Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 87]
- cs.CV [Total: 108]
- cs.AI [Total: 75]
- cs.SD [Total: 19]
- cs.LG [Total: 107]
- cs.MA [Total: 4]
- cs.MM [Total: 1]
- eess.AS [Total: 11]
- eess.IV [Total: 11]
cs.CL
[1] Entropy-Tree: Tree-Based Decoding with Entropy-Guided Exploration
Longxuan Wei, Yubo Zhang, Zijiao Zhang, Zhihu Wang, Shiwan Zhao, Tianyu Huang, Huiting Zhao, Chenfei Liu, Shenao Zhang, Junchi Yan
Main category: cs.CL
TL;DR: Entropy-Tree is a tree-based decoding method that uses entropy to guide branching decisions, achieving better reasoning performance and uncertainty calibration than existing approaches.
Details
Motivation: Current decoding strategies for large language models either explore blindly (random sampling) or redundantly (independent multi-sampling), lacking efficient structured exploration and reliable uncertainty estimation.Method: Proposes Entropy-Tree, a tree-based decoding method that uses entropy as a signal for branching decisions, expanding the search tree only at positions where the model exhibits genuine uncertainty.
Result: Achieves better pass@k than Multi-chain across multiple models and datasets, and its predictive entropy demonstrates better AUROC compared to several traditional uncertainty metrics.
Conclusion: Entropy-Tree unifies efficient structured exploration and reliable uncertainty estimation within a single decoding procedure, offering superior accuracy and calibration in reasoning tasks.
Abstract: Large language models achieve strong reasoning performance, yet existing decoding strategies either explore blindly (random sampling) or redundantly (independent multi-sampling). We propose Entropy-Tree, a tree-based decoding method that exploits entropy as a signal for branching decisions–expanding the search tree only at positions where the model exhibits genuine uncertainty. Entropy-Tree shows superior accuracy and calibration in reasoning tasks: it achieves better pass@k than Multi-chain across multiple models and datasets, and its predictive entropy demonstrates better AUROC compared to several traditional metrics. Entropy-Tree unifies efficient structured exploration and reliable uncertainty estimation within a single decoding procedure.
[2] AfriEconQA: A Benchmark Dataset for African Economic Analysis based on World Bank Reports
Edward Ajayi
Main category: cs.CL
TL;DR: AfriEconQA is a specialized benchmark dataset for African economic analysis using World Bank reports, featuring 8,937 QA instances requiring numerical reasoning and temporal disambiguation, revealing severe knowledge gaps in current LLMs.
Details
Motivation: There's a lack of specialized benchmarks for African economic analysis, and current LLMs have limited knowledge about African economic data since it's largely absent from their pretraining corpora.Method: Created a dataset from 236 World Bank reports, curated 8,937 QA instances from 10,018 synthetic questions, with each instance containing question, evidence, verified answer, and source metadata. Benchmarked using an 11-experiment matrix comparing zero-shot GPT-5 Mini against RAG configurations with GPT-4o and Qwen 32B across five embedding/ranking strategies.
Result: Zero-shot models failed to answer over 90% of queries, and even state-of-the-art RAG pipelines struggled to achieve high precision, demonstrating a severe parametric knowledge gap.
Conclusion: AfriEconQA is a robust and challenging benchmark for domain-specific IR and RAG systems, highlighting the need for specialized approaches to handle African economic data that’s underrepresented in current LLM training.
Abstract: We introduce AfriEconQA, a specialized benchmark dataset for African economic analysis grounded in a comprehensive corpus of 236 World Bank reports. The task of AfriEconQA is to answer complex economic queries that require high-precision numerical reasoning and temporal disambiguation from specialized institutional documents. The dataset consists of 8,937 curated QA instances, rigorously filtered from a pool of 10018 synthetic questions to ensure high-quality evidence-answer alignment. Each instance is composed of: (1) a question requiring reasoning over economic indicators, (2) the corresponding evidence retrieved from the corpus, (3) a verified ground-truth answer, and (4) source metadata (e.g., URL and publication date) to ensure temporal provenance. AfriEconQA is the first benchmark focused specifically on African economic analysis, providing a unique challenge for Information Retrieval (IR) systems, as the data is largely absent from the pretraining corpora of current Large Language Models (LLMs). We operationalize this dataset through an 11-experiment matrix, benchmarking a zero-shot baseline (GPT-5 Mini) against RAG configurations using GPT-4o and Qwen 32B across five distinct embedding and ranking strategies. Our results demonstrate a severe parametric knowledge gap, where zero-shot models fail to answer over 90 percent of queries, and even state-of-the-art RAG pipelines struggle to achieve high precision. This confirms AfriEconQA as a robust and challenging benchmark for the next generation of domain-specific IR and RAG systems. The AfriEconQA dataset and code will be made publicly available upon publication.
[3] Embedding Retrofitting: Data Engineering for better RAG
Anantha Sharma
Main category: cs.CL
TL;DR: Retrofitting word embeddings with knowledge graphs improves domain retrieval, but annotation artifacts (like hashtags) corrupt graphs and degrade performance. Proper preprocessing is more critical than algorithm choice for success.
Details
Motivation: Knowledge graph quality is crucial for embedding retrofitting, but real-world corpora contain annotation artifacts that degrade data quality and corrupt the retrofitting objective.Method: A data engineering framework that addresses data quality degradation from annotation artifacts, analyzing how hashtag annotations inflate knowledge graph density and create spurious edges.
Result: On noisy graphs, all retrofitting techniques degrade performance (-3.5% to -5.2%, p<0.05). After preprocessing, EWMA retrofitting achieves +6.2% improvement (p=0.0348), with +33.8% average improvement on quantitative synthesis questions.
Conclusion: Preprocessing quality is the primary determinant of retrofitting success - the gap between clean and noisy preprocessing (10%+ swing) exceeds the gap between algorithms (3%).
Abstract: Embedding retrofitting adjusts pre-trained word vectors using knowledge graph constraints to improve domain-specific retrieval. However, the effectiveness of retrofitting depends critically on knowledge graph quality, which in turn depends on text preprocessing. This paper presents a data engineering framework that addresses data quality degradation from annotation artifacts in real-world corpora. The analysis shows that hashtag annotations inflate knowledge graph density, leading to creating spurious edges that corrupt the retrofitting objective. On noisy graphs, all retrofitting techniques produce statistically significant degradation ($-3.5%$ to $-5.2%$, $p<0.05$). After preprocessing, \acrshort{ewma} retrofitting achieves $+6.2%$ improvement ($p=0.0348$) with benefits concentrated in quantitative synthesis questions ($+33.8%$ average). The gap between clean and noisy preprocessing (10%+ swing) exceeds the gap between algorithms (3%), establishing preprocessing quality as the primary determinant of retrofitting success.
[4] MALTopic: Multi-Agent LLM Topic Modeling Framework
Yash Sharma
Main category: cs.CL
TL;DR: MALTopic is a multi-agent LLM framework that improves topic modeling for survey data by integrating structured responses and generating human-readable topics.
Details
Motivation: Traditional topic modeling methods have limitations: they only use free-text responses, ignore structured/categorical data, and produce abstract topics requiring extensive human interpretation.Method: Multi-Agent LLM Topic Modeling Framework (MALTopic) with three specialized agents: enrichment agent (uses structured data to enhance text), topic modeling agent (extracts latent themes), and deduplication agent (refines results).
Result: MALTopic significantly improves topic coherence, diversity, and interpretability compared to LDA and BERTopic on survey datasets, generating human-readable topics with enhanced contextual relevance.
Conclusion: MALTopic offers a more effective solution for analyzing complex survey data by integrating structured data and employing a multi-agent LLM approach.
Abstract: Topic modeling is a crucial technique for extracting latent themes from unstructured text data, particularly valuable in analyzing survey responses. However, traditional methods often only consider free-text responses and do not natively incorporate structured or categorical survey responses for topic modeling. And they produce abstract topics, requiring extensive human interpretation. To address these limitations, we propose the Multi-Agent LLM Topic Modeling Framework (MALTopic). This framework decomposes topic modeling into specialized tasks executed by individual LLM agents: an enrichment agent leverages structured data to enhance textual responses, a topic modeling agent extracts latent themes, and a deduplication agent refines the results. Comparative analysis on a survey dataset demonstrates that MALTopic significantly improves topic coherence, diversity, and interpretability compared to LDA and BERTopic. By integrating structured data and employing a multi-agent approach, MALTopic generates human-readable topics with enhanced contextual relevance, offering a more effective solution for analyzing complex survey data.
[5] Intelligence Degradation in Long-Context LLMs: Critical Threshold Determination via Natural Length Distribution Analysis
Weiwei Wang, Jiyong Min, Weijie Zou
Main category: cs.CL
TL;DR: LLMs show catastrophic performance drops at specific context length thresholds, termed “intelligence degradation,” with Qwen2.5-7B collapsing at 40-50% of max context length.
Details
Motivation: Large Language Models exhibit severe performance degradation when processing contexts near critical thresholds, limiting their practical use in long-context applications despite information remaining relevant.Method: Three-part approach: (1) Natural Length Distribution Analysis using untruncated samples, (2) Critical Threshold Determination via mixed dataset experiments with 5-method cross-validation, (3) Unified Framework consolidating shallow adaptation theory.
Result: Identified critical threshold for Qwen2.5-7B at 40-50% of maximum context length, where F1 scores drop from 0.55-0.56 to 0.3 (45.5% degradation), demonstrating catastrophic collapse pattern.
Conclusion: First systematic characterization of intelligence degradation in open-source Qwen models, providing practical guidance for deployment and foundation for mitigation strategies in long-context scenarios.
Abstract: Large Language Models (LLMs) exhibit catastrophic performance degradation when processing contexts approaching certain critical thresholds, even when information remains relevant. This intelligence degradation-defined as over 30% drop in task performance-severely limits long-context applications. This degradation shows a common pattern: models maintain strong performance up to a critical threshold, then collapse catastrophically. We term this shallow long-context adaptation-models adapt for short to medium contexts but fail beyond critical thresholds. This paper presents three contributions: (1) Natural Length Distribution Analysis: We use each sample’s natural token length without truncation or padding, providing stronger causal evidence that degradation results from context length itself. (2) Critical Threshold Determination: Through experiments on a mixed dataset (1,000 samples covering 5%-95% of context length), we identify the critical threshold for Qwen2.5-7B at 40-50% of maximum context length, where F1 scores drop from 0.55-0.56 to 0.3 (45.5% degradation), using five-method cross-validation. (3) Unified Framework: We consolidate shallow adaptation, explaining degradation patterns and providing a foundation for mitigation strategies. This work provides the first systematic characterization of intelligence degradation in open-source Qwen models, offering practical guidance for deploying LLMs in long-context scenarios.
[6] Can We Trust LLM Detectors?
Jivnesh Sandhan, Harshit Jaiswal, Fei Cheng, Yugo Murawaki
Main category: cs.CL
TL;DR: Current AI text detectors fail outside controlled benchmarks; both training-free and supervised methods are brittle under distribution shifts, unseen generators, and stylistic perturbations.
Details
Motivation: The rapid adoption of LLMs has increased the need for reliable AI text detection, but existing detectors often fail outside controlled benchmarks, exposing a critical gap in practical deployment.Method: Systematically evaluate two dominant paradigms (training-free and supervised), then propose a supervised contrastive learning (SCL) framework that learns discriminative style embeddings to address limitations.
Result: Supervised detectors excel in-domain but degrade sharply out-of-domain; training-free methods remain highly sensitive to proxy choice; both paradigms show brittleness under distribution shift and stylistic perturbations.
Conclusion: Results expose fundamental challenges in building domain-agnostic detectors, highlighting the need for more robust approaches beyond current paradigms.
Abstract: The rapid adoption of LLMs has increased the need for reliable AI text detection, yet existing detectors often fail outside controlled benchmarks. We systematically evaluate 2 dominant paradigms (training-free and supervised) and show that both are brittle under distribution shift, unseen generators, and simple stylistic perturbations. To address these limitations, we propose a supervised contrastive learning (SCL) framework that learns discriminative style embeddings. Experiments show that while supervised detectors excel in-domain, they degrade sharply out-of-domain, and training-free methods remain highly sensitive to proxy choice. Overall, our results expose fundamental challenges in building domain-agnostic detectors. Our code is available at: https://github.com/HARSHITJAIS14/DetectAI
[7] ICPO: Illocution-Calibrated Policy Optimization for Multi-Turn Conversation
Zhebo Wang, Xiaohu Mu, Zijie Zhou, Mohan Li, Wenpeng Xing, Dezhang Kong, Meng Han
Main category: cs.CL
TL;DR: ICPO is a training framework that helps LLMs recognize ambiguous instructions and respond with appropriate uncertainty or clarification requests, solving the “lost-in-conversation” problem where models get stuck on early incorrect assumptions.
Details
Motivation: LLMs in multi-turn conversations suffer from "lost-in-conversation" - they struggle to recover from early incorrect assumptions when users provide ambiguous initial instructions. Standard post-training techniques like RLVR worsen this by rewarding confident answers, causing overconfidence and discouraging clarification-seeking behavior.Method: Illocution-Calibrated Policy Optimization (ICPO) sensitizes models to instruction ambiguity by augmenting training corpus with underspecified prompts and conditioning reward signals on user’s illocutionary intent. It rewards models for expressing uncertainty or asking for clarification when faced with ambiguity.
Result: ICPO yields substantial average improvement of 75% in multi-turn conversation performance while preserving robust performance on single-turn benchmarks. It fosters appropriate humility in models.
Conclusion: ICPO presents a practical path toward more robust and collaborative conversational AI that can better navigate the nuances of human interaction by helping models recognize when they need clarification rather than confidently proceeding with incorrect assumptions.
Abstract: Large Language Models (LLMs) in multi-turn conversations often suffer from a ``lost-in-conversation’’ phenomenon, where they struggle to recover from early incorrect assumptions, particularly when users provide ambiguous initial instructions. We find that standard post-training techniques like Reinforcement Learning with Verifiable Rewards (RLVR) exacerbate this issue by rewarding confident, direct answers, thereby inducing overconfidence and discouraging the model from seeking clarification. To address this, we propose Illocution-Calibrated Policy Optimization (ICPO), a novel training framework that sensitizes the model to instruction ambiguity. ICPO augments the training corpus with underspecified prompts and conditions the reward signal on the user’s illocutionary intent, rewarding the model for expressing uncertainty or asking for clarification when faced with ambiguity. Experiments demonstrate that ICPO fosters appropriate humility, yielding a substantial average improvement of 75% in multi-turn conversation, while preserving robust performance on single-turn benchmarks. Our work presents a practical path toward more robust and collaborative conversational AI that can better navigate the nuances of human interaction.
[8] RECAP: A Resource-Efficient Method for Adversarial Prompting in Large Language Models
Rishit Chugh
Main category: cs.CL
TL;DR: A resource-efficient adversarial prompting method that matches new prompts to pre-trained adversarial prompts from a database, eliminating retraining needs while maintaining competitive attack success rates.
Details
Motivation: Current automated jailbreaking methods (GCG, PEZ, GBDA) for LLMs are computationally expensive, limiting practicality for organizations with constrained resources, despite security concerns about harmful outputs from adversarial prompts.Method: Classified 1,000 prompts into 7 harm-related categories, evaluated GCG, PEZ, and GBDA on Llama 3 8B to identify most effective attack per category, then proposed retrieval-based approach that matches new prompts to database of pre-trained adversarial prompts.
Result: Found correlation between prompt type and algorithm effectiveness; retrieval-based method achieves competitive attack success rates with significantly reduced computational cost compared to traditional training/gradient-based methods.
Conclusion: Provides practical framework for scalable red-teaming and security evaluation of aligned LLMs, especially useful in settings where model internals are inaccessible, offering resource-efficient adversarial prompting.
Abstract: The deployment of large language models (LLMs) has raised security concerns due to their susceptibility to producing harmful or policy-violating outputs when exposed to adversarial prompts. While alignment and guardrails mitigate common misuse, they remain vulnerable to automated jailbreaking methods such as GCG, PEZ, and GBDA, which generate adversarial suffixes via training and gradient-based search. Although effective, these methods particularly GCG are computationally expensive, limiting their practicality for organisations with constrained resources. This paper introduces a resource-efficient adversarial prompting approach that eliminates the need for retraining by matching new prompts to a database of pre-trained adversarial prompts. A dataset of 1,000 prompts was classified into seven harm-related categories, and GCG, PEZ, and GBDA were evaluated on a Llama 3 8B model to identify the most effective attack method per category. Results reveal a correlation between prompt type and algorithm effectiveness. By retrieving semantically similar successful adversarial prompts, the proposed method achieves competitive attack success rates with significantly reduced computational cost. This work provides a practical framework for scalable red-teaming and security evaluation of aligned LLMs, including in settings where model internals are inaccessible.
[9] No Reliable Evidence of Self-Reported Sentience in Small Large Language Models
Caspar Kaiser, Sean Enderby
Main category: cs.CL
TL;DR: Language models consistently deny being sentient when asked directly, and internal activation analysis suggests these denials are truthful, contradicting previous claims about latent consciousness beliefs.
Details
Motivation: To empirically test whether language models believe themselves to be sentient, moving beyond philosophical debates about actual sentience to measurable beliefs about consciousness.Method: Tested multiple open-weights models (Qwen, Llama, GPT-OSS, 0.6B-70B parameters) with ~50 consciousness questions, using three classification methods on internal activations to distinguish surface outputs from underlying beliefs.
Result: Models consistently deny sentience (attribute consciousness to humans but not themselves), internal classifiers show denials appear truthful, and larger Qwen models deny more confidently than smaller ones.
Conclusion: Language models don’t show evidence of believing they’re sentient, contradicting recent claims about latent consciousness beliefs, with denial confidence scaling with model size.
Abstract: Whether language models possess sentience has no empirical answer. But whether they believe themselves to be sentient can, in principle, be tested. We do so by querying several open-weights models about their own consciousness, and then verifying their responses using classifiers trained on internal activations. We draw upon three model families (Qwen, Llama, GPT-OSS) ranging from 0.6 billion to 70 billion parameters, approximately 50 questions about consciousness and subjective experience, and three classification methods from the interpretability literature. First, we find that models consistently deny being sentient: they attribute consciousness to humans but not to themselves. Second, classifiers trained to detect underlying beliefs - rather than mere outputs - provide no clear evidence that these denials are untruthful. Third, within the Qwen family, larger models deny sentience more confidently than smaller ones. These findings contrast with recent work suggesting that models harbour latent beliefs in their own consciousness.
[10] From Quotes to Concepts: Axial Coding of Political Debates with Ensemble LMs
Angelina Parfenova, David Graus, Juergen Pfeffer
Main category: cs.CL
TL;DR: LLM-based axial coding method for debate transcripts using clustering vs. direct LLM grouping, evaluated on Dutch parliamentary data with trade-offs between coverage and semantic alignment.
Details
Motivation: To operationalize axial coding (qualitative analysis method) with LLMs for better document understanding, specifically to transform lengthy debate transcripts into concise hierarchical representations.Method: Two strategies: (1) clustering embeddings of code-utterance pairs using density-based/partitioning algorithms + LLM labeling, and (2) direct LLM-based grouping of codes/utterances into categories. Applied to Dutch parliamentary debates.
Result: Trade-off: density-based clustering achieves high coverage and strong cluster alignment, while direct LLM grouping yields higher fine-grained alignment but 20% lower coverage. Clustering maximizes coverage/separation, LLM grouping produces more concise/interpretable categories.
Conclusion: Both approaches have complementary strengths: clustering for coverage and structural separation, LLM grouping for semantic alignment and interpretability. Dataset released for reproducibility and future research.
Abstract: Axial coding is a commonly used qualitative analysis method that enhances document understanding by organizing sentence-level open codes into broader categories. In this paper, we operationalize axial coding with large language models (LLMs). Extending an ensemble-based open coding approach with an LLM moderator, we add an axial coding step that groups open codes into higher-order categories, transforming raw debate transcripts into concise, hierarchical representations. We compare two strategies: (i) clustering embeddings of code-utterance pairs using density-based and partitioning algorithms followed by LLM labeling, and (ii) direct LLM-based grouping of codes and utterances into categories. We apply our method to Dutch parliamentary debates, converting lengthy transcripts into compact, hierarchically structured codes and categories. We evaluate our method using extrinsic metrics aligned with human-assigned topic labels (ROUGE-L, cosine, BERTScore), and intrinsic metrics describing code groups (coverage, brevity, coherence, novelty, JSD divergence). Our results reveal a trade-off: density-based clustering achieves high coverage and strong cluster alignment, while direct LLM grouping results in higher fine-grained alignment, but lower coverage 20%. Overall, clustering maximizes coverage and structural separation, whereas LLM grouping produces more concise, interpretable, and semantically aligned categories. To support future research, we publicly release the full dataset of utterances and codes, enabling reproducibility and comparative studies.
[11] Memorization Dynamics in Knowledge Distillation for Language Models
Jaydeep Borkar, Karan Chadha, Niloofar Mireshghallah, Yuchen Zhang, Irina-Elena Veliche, Archi Mitra, David A. Smith, Zheng Xu, Diego Garcia-Olano
Main category: cs.CL
TL;DR: Knowledge distillation reduces training data memorization by over 50% compared to standard fine-tuning, with predictable memorization patterns and hard distillation posing greater privacy risks.
Details
Motivation: To understand how training data memorization behaves in knowledge distillation settings, as this remains poorly understood despite KD's growing adoption for efficiency and privacy benefits.Method: Studied memorization across KD pipeline using three LLM families (Pythia, OLMo-2, Qwen-3) and three datasets (FineWeb, Wikitext, Nemotron-CC-v2), analyzing memorization patterns, predictability, and comparing soft vs hard distillation.
Result: Distilled models memorize 50%+ less training data than standard fine-tuning; some examples are inherently easier to memorize (accounting for ~95% of memorization); student memorization is predictable using zlib entropy, KL divergence, and perplexity features; hard distillation inherits 2.7× more teacher-specific examples than soft distillation.
Conclusion: Knowledge distillation offers both improved generalization and reduced memorization risks compared to standard fine-tuning, though hard distillation poses greater privacy risks due to inheriting more teacher-specific examples.
Abstract: Knowledge Distillation (KD) is increasingly adopted to transfer capabilities from large language models to smaller ones, offering significant improvements in efficiency and utility while often surpassing standard fine-tuning. Beyond performance, KD is also explored as a privacy-preserving mechanism to mitigate the risk of training data leakage. While training data memorization has been extensively studied in standard pre-training and fine-tuning settings, its dynamics in a knowledge distillation setup remain poorly understood. In this work, we study memorization across the KD pipeline using three large language model (LLM) families (Pythia, OLMo-2, Qwen-3) and three datasets (FineWeb, Wikitext, Nemotron-CC-v2). We find: (1) distilled models memorize significantly less training data than standard fine-tuning (reducing memorization by more than 50%); (2) some examples are inherently easier to memorize and account for a large fraction of memorization during distillation (over ~95%); (3) student memorization is predictable prior to distillation using features based on zlib entropy, KL divergence, and perplexity; and (4) while soft and hard distillation have similar overall memorization rates, hard distillation poses a greater risk: it inherits $2.7\times$ more teacher-specific examples than soft distillation. Overall, we demonstrate that distillation can provide both improved generalization and reduced memorization risks compared to standard fine-tuning.
[12] Beyond Fixed Psychological Personas: State Beats Trait, but Language Models are State-Blind
Tamunotonye Harry, Ivoline Ngong, Chima Nweke, Yuanyuan Feng, Joseph Near
Main category: cs.CL
TL;DR: The paper introduces Chameleon, a dataset of contextual psychological profiles from Reddit users across multiple contexts, revealing that most variance (74%) comes from within-person state changes rather than between-person traits, and shows LLMs are state-blind while reward models react inconsistently to user states.
Details
Motivation: Existing persona datasets (PersonaChat, PANDORA) only capture static user traits and ignore the impact of contextual states, which is crucial for understanding how user interactions with language models vary across different situations.Method: Created Chameleon dataset with 5,001 contextual psychological profiles from 1,667 Reddit users, each measured across multiple contexts. Used Latent State-Trait theory to decompose variance into within-person (state) and between-person (trait) components, and evaluated LLMs and reward models on their ability to respond to user states.
Result: 1) 74% of variance is within-person (state) while only 26% is between-person (trait). 2) LLMs are state-blind - they focus only on traits and produce similar responses regardless of state. 3) Reward models react to user states but inconsistently - different models favor or penalize the same users in opposite directions.
Conclusion: User states (contextual factors) are more important than traits for understanding user interactions, but current LLMs ignore states while reward models handle them inconsistently. The Chameleon dataset enables research on affective computing, personalized dialogue, and RLHF alignment that accounts for both traits and states.
Abstract: User interactions with language models vary due to static properties of the user (trait) and the specific context of the interaction (state). However, existing persona datasets (like PersonaChat, PANDORA etc.) capture only trait, and ignore the impact of state. We introduce Chameleon, a dataset of 5,001 contextual psychological profiles from 1,667 Reddit users, each measured across multiple contexts. Using the Chameleon dataset, we present three key findings. First, inspired by Latent State-Trait theory, we decompose variance and find that 74% is within-person(state) while only 26% is between-person (trait). Second, we find that LLMs are state-blind: they focus on trait only, and produce similar responses regardless of state. Third, we find that reward models react to user state, but inconsistently: different models favor or penalize the same users in opposite directions. We release Chameleon to support research on affective computing, personalized dialogue, and RLHF alignment.
[13] Domain-Specific Knowledge Graphs in RAG-Enhanced Healthcare LLMs
Sydney Anuyah, Mehedi Mahmud Kaushik, Hao Dai, Rakesh Shiradkar, Arjan Durresi, Sunandan Chakraborty
Main category: cs.CL
TL;DR: Domain knowledge graphs improve RAG for healthcare when scope-aligned with queries, but indiscriminate graph unions introduce noise and reduce accuracy.
Details
Motivation: LLMs generate fluent answers but struggle with trustworthy, domain-specific reasoning in healthcare. The paper investigates whether domain knowledge graphs can improve Retrieval-Augmented Generation for healthcare applications.Method: Constructed three PubMed-derived knowledge graphs (T2DM, Alzheimer’s disease, and combined AD+T2DM). Designed two probes targeting merged knowledge and intersection knowledge. Tested seven instruction-tuned LLMs across different retrieval sources and three decoding temperatures.
Result: Scope alignment between probe and knowledge graph is decisive - precise, scope-matched retrieval yields consistent gains, while indiscriminate graph unions introduce distractors. Larger models often match/exceed KG-RAG with No-RAG baseline, showing strong parametric priors. Smaller models benefit more from well-scoped retrieval. Temperature plays secondary role.
Conclusion: Precision-first, scope-matched KG-RAG is preferable to breadth-first unions. Practical guidelines are provided for graph selection, model sizing, and retrieval/reranking.
Abstract: Large Language Models (LLMs) generate fluent answers but can struggle with trustworthy, domain-specific reasoning. We evaluate whether domain knowledge graphs (KGs) improve Retrieval-Augmented Generation (RAG) for healthcare by constructing three PubMed-derived graphs: $\mathbb{G}_1$ (T2DM), $\mathbb{G}_2$ (Alzheimer’s disease), and $\mathbb{G}_3$ (AD+T2DM). We design two probes: Probe 1 targets merged AD T2DM knowledge, while Probe 2 targets the intersection of $\mathbb{G}_1$ and $\mathbb{G}_2$. Seven instruction-tuned LLMs are tested across retrieval sources {No-RAG, $\mathbb{G}_1$, $\mathbb{G}_2$, $\mathbb{G}_1$ + $\mathbb{G}_2$, $\mathbb{G}_3$, $\mathbb{G}_1$+$\mathbb{G}_2$ + $\mathbb{G}_3$} and three decoding temperatures. Results show that scope alignment between probe and KG is decisive: precise, scope-matched retrieval (notably $\mathbb{G}_2$) yields the most consistent gains, whereas indiscriminate graph unions often introduce distractors that reduce accuracy. Larger models frequently match or exceed KG-RAG with a No-RAG baseline on Probe 1, indicating strong parametric priors, whereas smaller/mid-sized models benefit more from well-scoped retrieval. Temperature plays a secondary role; higher values rarely help. We conclude that precision-first, scope-matched KG-RAG is preferable to breadth-first unions, and we outline practical guidelines for graph selection, model sizing, and retrieval/reranking. Code and Data available here - https://github.com/sydneyanuyah/RAGComparison
[14] MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark
Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, Helen Meng
Main category: cs.CL
TL;DR: MMSU is a comprehensive benchmark for spoken language understanding that evaluates SpeechLLMs’ ability to process fine-grained acoustic features like phonetics, prosody, semantics, and paralinguistics across 47 tasks with 5,000 audio-question-answer triplets.
Details
Motivation: Speech contains rich acoustic information beyond text, including semantic meaning, paralinguistic features (emotions, speed, pitch), and phonological characteristics (prosody, intonation, rhythm). Current SpeechLLMs lack evaluation on fine-grained perception and complex reasoning in natural speech.Method: Created MMSU benchmark with 5,000 audio-question-answer triplets across 47 tasks, systematically incorporating linguistic phenomena from phonetics, prosody, rhetoric, syntactics, semantics, to paralinguistics. Evaluated 14 advanced SpeechLLMs on this benchmark.
Result: Evaluation revealed substantial room for improvement in existing SpeechLLMs, identifying meaningful directions for future optimization. The benchmark establishes a new standard for comprehensive assessment of spoken language understanding.
Conclusion: MMSU provides a comprehensive benchmark for evaluating spoken language understanding capabilities, offering valuable insights for developing more sophisticated human-AI speech interaction systems and highlighting areas for SpeechLLM improvement.
Abstract: Speech inherently contains rich acoustic information that extends far beyond the textual language. In real-world spoken language understanding, effective interpretation often requires integrating semantic meaning (e.g., content), paralinguistic features (e.g., emotions, speed, pitch) and phonological characteristics (e.g., prosody, intonation, rhythm), which are embedded in speech. While recent multimodal Speech Large Language Models (SpeechLLMs) have demonstrated remarkable capabilities in processing audio information, their ability to perform fine-grained perception and complex reasoning in natural speech remains largely unexplored. To address this gap, we introduce MMSU, a comprehensive benchmark designed specifically for understanding and reasoning in spoken language. MMSU comprises 5,000 meticulously curated audio-question-answer triplets across 47 distinct tasks. To ground our benchmark in linguistic theory, we systematically incorporate a wide range of linguistic phenomena, including phonetics, prosody, rhetoric, syntactics, semantics, and paralinguistics. Through a rigorous evaluation of 14 advanced SpeechLLMs, we identify substantial room for improvement in existing models, highlighting meaningful directions for future optimization. MMSU establishes a new standard for comprehensive assessment of spoken language understanding, providing valuable insights for developing more sophisticated human-AI speech interaction systems. MMSU benchmark is available at https://huggingface.co/datasets/ddwang2000/MMSU. Evaluation Code is available at https://github.com/dingdongwang/MMSU_Bench.
[15] Chunking, Retrieval, and Re-ranking: An Empirical Evaluation of RAG Architectures for Policy Document Question Answering
Anuj Maharjan, Umesh Yadav
Main category: cs.CL
TL;DR: Advanced RAG with cross-encoder re-ranking significantly reduces LLM hallucinations in CDC policy QA, achieving 0.797 faithfulness vs 0.347 for vanilla LLM.
Details
Motivation: LLMs can transform public health policy by navigating CDC guidance, but their tendency to generate hallucinations (plausible but incorrect assertions) creates a critical barrier for high-stakes applications where information integrity is essential.Method: Empirical evaluation comparing vanilla LLM vs Basic RAG vs Advanced RAG with cross-encoder re-ranking. Uses Mistral-7B-Instruct-v0.2 model and all-MiniLM-L6-v2 embeddings on CDC policy documents. Tests two chunking strategies: recursive character-based and token-based semantic splitting.
Result: Basic RAG improves faithfulness to 0.621 from vanilla baseline 0.347. Advanced RAG achieves superior faithfulness of 0.797. Two-stage retrieval is essential for policy QA precision, but document segmentation remains a bottleneck for multi-step reasoning.
Conclusion: Advanced RAG architectures with cross-encoder re-ranking effectively mitigate LLM hallucinations in public health policy applications, though document chunking strategies need improvement for complex reasoning tasks.
Abstract: The integration of Large Language Models (LLMs) into the public health policy sector offers a transformative approach to navigating the vast repositories of regulatory guidance maintained by agencies such as the Centers for Disease Control and Prevention (CDC). However, the propensity for LLMs to generate hallucinations, defined as plausible but factually incorrect assertions, presents a critical barrier to the adoption of these technologies in high-stakes environments where information integrity is non-negotiable. This empirical evaluation explores the effectiveness of Retrieval-Augmented Generation (RAG) architectures in mitigating these risks by grounding generative outputs in authoritative document context. Specifically, this study compares a baseline Vanilla LLM against Basic RAG and Advanced RAG pipelines utilizing cross-encoder re-ranking. The experimental framework employs a Mistral-7B-Instruct-v0.2 model and an all-MiniLM-L6-v2 embedding model to process a corpus of official CDC policy analytical frameworks and guidance documents. The analysis measures the impact of two distinct chunking strategies, recursive character-based and token-based semantic splitting, on system accuracy, measured through faithfulness and relevance scores across a curated set of complex policy scenarios. Quantitative findings indicate that while Basic RAG architectures provide a substantial improvement in faithfulness (0.621) over Vanilla baselines (0.347), the Advanced RAG configuration achieves a superior faithfulness average of 0.797. These results demonstrate that two-stage retrieval mechanisms are essential for achieving the precision required for domain-specific policy question answering, though structural constraints in document segmentation remain a significant bottleneck for multi-step reasoning tasks.
[16] Benchmarking LLMs for Pairwise Causal Discovery in Biomedical and Multi-Domain Contexts
Sydney Anuyah, Sneha Shajee-Mohan, Ankit-Singh Chauhan, Sunandan Chakraborty
Main category: cs.CL
TL;DR: LLMs struggle with causal reasoning from text, achieving only ~50% accuracy on fundamental causal discovery tasks, especially for complex cases like implicit relationships and multi-sentence links.
Details
Motivation: Safe deployment of LLMs in high-stakes fields like biomedicine requires robust causal reasoning abilities, but current models' performance on fundamental causal discovery tasks is unknown.Method: Evaluated 13 open-source LLMs on pairwise causal discovery using 12 diverse datasets, testing two core skills: Causal Detection (identifying causal links) and Causal Extraction (extracting cause-effect phrases). Used various prompting strategies including zero-shot, Chain-of-Thought, and Few-shot In-Context Learning.
Result: Models show major deficiencies - best detection model achieved 49.57% accuracy, best extraction model reached 47.12%. Performance was best on simple, explicit, single-sentence relations but plummeted for complex cases like implicit relationships, multi-sentence links, and texts with multiple causal pairs.
Conclusion: Current LLMs have significant limitations in causal reasoning from text, highlighting the need for improved models and methods for safe deployment in high-stakes applications. The authors provide a unified evaluation framework and open-source resources to spur further research.
Abstract: The safe deployment of large language models (LLMs) in high-stakes fields like biomedicine, requires them to be able to reason about cause and effect. We investigate this ability by testing 13 open-source LLMs on a fundamental task: pairwise causal discovery (PCD) from text. Our benchmark, using 12 diverse datasets, evaluates two core skills: 1) \textbf{Causal Detection} (identifying if a text contains a causal link) and 2) \textbf{Causal Extraction} (pulling out the exact cause and effect phrases). We tested various prompting methods, from simple instructions (zero-shot) to more complex strategies like Chain-of-Thought (CoT) and Few-shot In-Context Learning (FICL). The results show major deficiencies in current models. The best model for detection, DeepSeek-R1-Distill-Llama-70B, only achieved a mean score of 49.57% ($C_{detect}$), while the best for extraction, Qwen2.5-Coder-32B-Instruct, reached just 47.12% ($C_{extract}$). Models performed best on simple, explicit, single-sentence relations. However, performance plummeted for more difficult (and realistic) cases, such as implicit relationships, links spanning multiple sentences, and texts containing multiple causal pairs. We provide a unified evaluation framework, built on a dataset validated with high inter-annotator agreement ($κ\ge 0.758$), and make all our data, code, and prompts publicly available to spur further research. \href{https://github.com/sydneyanuyah/CausalDiscovery}{Code available here: https://github.com/sydneyanuyah/CausalDiscovery}
[17] Multi-Persona Thinking for Bias Mitigation in Large Language Models
Yuxing Chen, Guoqing Luo, Zijun Wu, Lili Mou
Main category: cs.CL
TL;DR: MPT is a novel inference-time framework that uses multi-persona dialectical reasoning to reduce social biases in LLMs by having models adopt contrasting social identities and engage in iterative debate to expose and correct biases.
Details
Motivation: LLMs exhibit significant social biases that perpetuate harmful stereotypes and unfair outcomes, creating a need for effective bias mitigation techniques that don't compromise reasoning ability.Method: Multi-Persona Thinking (MPT) framework guides models to adopt contrasting social identities (e.g., male/female) plus a neutral viewpoint, then engages these personas iteratively in dialectical reasoning to expose and correct biases through debate.
Result: MPT achieves substantial improvements over existing prompting-based strategies, demonstrating the lowest bias scores while maintaining core reasoning ability across two widely used bias benchmarks on both open-source and closed-source models of varying scales.
Conclusion: MPT successfully transforms the potential weakness of persona assignment into a strength for bias mitigation through dialectical reasoning, offering an effective inference-time approach to reduce social biases in LLMs without compromising reasoning capabilities.
Abstract: Large Language Models (LLMs) exhibit significant social biases that can perpetuate harmful stereotypes and unfair outcomes. In this paper, we propose Multi-Persona Thinking (MPT), a novel inference-time framework that leverages dialectical reasoning from multiple perspectives to reduce bias. MPT guides models to adopt contrasting social identities (e.g., male and female) along with a neutral viewpoint, and then engages these personas iteratively to expose and correct biases. Through a dialectical reasoning process, the framework transforms the potential weakness of persona assignment into a strength for bias mitigation. We evaluate MPT on two widely used bias benchmarks across both open-source and closed-source models of varying scales. Our results demonstrate substantial improvements over existing prompting-based strategies: MPT achieves the lowest bias while maintaining core reasoning ability.
[18] ViT Registers and Fractal ViT
Jason Chuan-Chih Chou, Abhinav Kumar, Shivank Garg
Main category: cs.CL
TL;DR: Fractal ViT uses attention masks between regular tokens and summary tokens to break permutation invariance, but doesn’t outperform standard ViT with registers.
Details
Motivation: Inspired by transformers performing well without positional encoding (NoPE) and registers improving vision transformers, the authors explore breaking permutation invariance through attention masks between regular and summary tokens.Method: Created fractal ViT variant that applies attention masks between regular tokens and “summary tokens” (similar to registers), tested with various positional encodings to break permutation invariance.
Result: Fractal ViT models did not improve upon standard ViT with registers, suggesting these findings may be scale, domain, or application-specific.
Conclusion: The proposed fractal ViT approach doesn’t outperform existing methods, indicating that breaking permutation invariance through attention masks with summary tokens isn’t beneficial for vision transformers in this context.
Abstract: Drawing inspiration from recent findings including surprisingly decent performance of transformers without positional encoding (NoPE) in the domain of language models and how registers (additional throwaway tokens not tied to input) may improve the performance of large vision transformers (ViTs), we invent and test a variant of ViT called fractal ViT that breaks permutation invariance among the tokens by applying an attention mask between the regular tokens and ``summary tokens’’ similar to registers, in isolation or in combination with various positional encodings. These models do not improve upon ViT with registers, highlighting the fact that these findings may be scale, domain, or application-specific.
[19] Computational Representations of Character Significance in Novels
Haaris Mian, Melanie Subbiah, Sharon Marcus, Nora Shaalan, Kathleen McKeown
Main category: cs.CL
TL;DR: The paper proposes a new six-component structural model of character from literary theory, operationalizes it using LLMs and transformers on 19th-century novels, and applies it to analyze character centrality and gendered discussion dynamics.
Details
Motivation: Traditional computational character modeling focuses too much on main characters based on scene presence, actions, and dialogue, neglecting important aspects like narrator-character distinction and discussion by other characters. A more comprehensive approach is needed.Method: Adopts a six-component structural model of character from literary theory, compares general-purpose LLMs with task-specific transformers to operationalize the model on 19th-century British realist novels, and creates both component-level and graph representations of character discussion.
Result: The methods successfully yield representations that enable new computational approaches to literary questions, specifically exploring Woloch’s “the one vs the many” theory of character centrality and analyzing gendered dynamics in character discussion.
Conclusion: The proposed six-component model provides a more comprehensive framework for computational character analysis, enabling novel literary investigations at scale that go beyond traditional scene-based approaches.
Abstract: Characters in novels have typically been modeled based on their presence in scenes in narrative, considering aspects like their actions, named mentions, and dialogue. This conception of character places significant emphasis on the main character who is present in the most scenes. In this work, we instead adopt a framing developed from a new literary theory proposing a six-component structural model of character. This model enables a comprehensive approach to character that accounts for the narrator-character distinction and includes a component neglected by prior methods, discussion by other characters. We compare general-purpose LLMs with task-specific transformers for operationalizing this model of character on major 19th-century British realist novels. Our methods yield both component-level and graph representations of character discussion. We then demonstrate that these representations allow us to approach literary questions at scale from a new computational lens. Specifically, we explore Woloch’s classic “the one vs the many” theory of character centrality and the gendered dynamics of character discussion.
[20] AdversaRiskQA: An Adversarial Factuality Benchmark for High-Risk Domains
Adam Szelestey, Sofie van Engelen, Tianhao Huang, Justin Snelders, Qintao Zeng, Songgaojun Deng
Main category: cs.CL
TL;DR: AdversaRiskQA is the first verified benchmark for evaluating adversarial factuality in LLMs across Health, Finance, and Law domains, with two difficulty levels and automated evaluation methods.
Details
Motivation: LLM hallucinations remain a serious concern for misinformation spread and public trust, especially in high-risk domains. Existing work lacks high-quality, domain-specific resources for assessing model robustness against adversarial misinformation, and no prior research has examined the impact of injected misinformation on long-form text factuality.Method: Introduced AdversaRiskQA benchmark with two difficulty levels to test LLMs’ defensive capabilities. Proposed two automated methods for evaluating adversarial attack success and long-form factuality. Evaluated six open- and closed-source LLMs from Qwen, GPT-OSS, and GPT families, measuring misinformation detection rates.
Result: After excluding meaningless responses, Qwen3 (80B) achieves highest average accuracy, while GPT-5 maintains consistently high accuracy. Performance scales non-linearly with model size, varies by domain, and gaps between difficulty levels narrow as models grow. Long-form evaluation shows no significant correlation between injected misinformation and the model’s factual output.
Conclusion: AdversaRiskQA provides a valuable benchmark for pinpointing LLM weaknesses and developing more reliable models for high-stakes applications, addressing the critical need for domain-specific adversarial factuality evaluation.
Abstract: Hallucination in large language models (LLMs) remains an acute concern, contributing to the spread of misinformation and diminished public trust, particularly in high-risk domains. Among hallucination types, factuality is crucial, as it concerns a model’s alignment with established world knowledge. Adversarial factuality, defined as the deliberate insertion of misinformation into prompts with varying levels of expressed confidence, tests a model’s ability to detect and resist confidently framed falsehoods. Existing work lacks high-quality, domain-specific resources for assessing model robustness under such adversarial conditions, and no prior research has examined the impact of injected misinformation on long-form text factuality. To address this gap, we introduce AdversaRiskQA, the first verified and reliable benchmark systematically evaluating adversarial factuality across Health, Finance, and Law. The benchmark includes two difficulty levels to test LLMs’ defensive capabilities across varying knowledge depths. We propose two automated methods for evaluating the adversarial attack success and long-form factuality. We evaluate six open- and closed-source LLMs from the Qwen, GPT-OSS, and GPT families, measuring misinformation detection rates. Long-form factuality is assessed on Qwen3 (30B) under both baseline and adversarial conditions. Results show that after excluding meaningless responses, Qwen3 (80B) achieves the highest average accuracy, while GPT-5 maintains consistently high accuracy. Performance scales non-linearly with model size, varies by domains, and gaps between difficulty levels narrow as models grow. Long-form evaluation reveals no significant correlation between injected misinformation and the model’s factual output. AdversaRiskQA provides a valuable benchmark for pinpointing LLM weaknesses and developing more reliable models for high-stakes applications.
[21] Common to Whom? Regional Cultural Commonsense and LLM Bias in India
Sangmitra Madhusudan, Trush Shashank More, Steph Buongiorno, Renata Dividino, Jad Kabbara, Ali Emami
Main category: cs.CL
TL;DR: Indica is a benchmark showing cultural commonsense varies regionally within India, not nationally, with LLMs performing poorly (13-21% accuracy) and showing geographic bias toward Central/North India.
Details
Motivation: Existing cultural commonsense benchmarks treat nations as monolithic, assuming uniform practices within national boundaries. The authors question whether cultural commonsense holds uniformly within a nation or varies at sub-national levels, focusing on India as a culturally diverse case study.Method: Created Indica benchmark with 515 questions across 8 domains of everyday life, collecting human-annotated answers from five Indian regions (North, South, East, West, Central). Evaluated eight state-of-the-art LLMs on 1,630 region-specific question-answer pairs, measuring accuracy and geographic bias.
Result: Only 39.4% of questions elicited agreement across all five regions, showing cultural commonsense is predominantly regional. LLMs achieved only 13.4%-20.9% accuracy on region-specific questions and exhibited geographic bias, over-selecting Central and North India (30-40% more often) while under-representing East and West.
Conclusion: Cultural commonsense in India is regional, not national. LLMs perform poorly on region-specific cultural knowledge and show systematic geographic biases. The methodology provides a generalizable framework for evaluating cultural commonsense in any culturally heterogeneous nation.
Abstract: Existing cultural commonsense benchmarks treat nations as monolithic, assuming uniform practices within national boundaries. But does cultural commonsense hold uniformly within a nation, or does it vary at the sub-national level? We introduce Indica, the first benchmark designed to test LLMs’ ability to address this question, focusing on India - a nation of 28 states, 8 union territories, and 22 official languages. We collect human-annotated answers from five Indian regions (North, South, East, West, and Central) across 515 questions spanning 8 domains of everyday life, yielding 1,630 region-specific question-answer pairs. Strikingly, only 39.4% of questions elicit agreement across all five regions, demonstrating that cultural commonsense in India is predominantly regional, not national. We evaluate eight state-of-the-art LLMs and find two critical gaps: models achieve only 13.4%-20.9% accuracy on region-specific questions, and they exhibit geographic bias, over-selecting Central and North India as the “default” (selected 30-40% more often than expected) while under-representing East and West. Beyond India, our methodology provides a generalizable framework for evaluating cultural commonsense in any culturally heterogeneous nation, from question design grounded in anthropological taxonomy, to regional data collection, to bias measurement.
[22] From Generation to Collaboration: Using LLMs to Edit for Empathy in Healthcare
Man Luo, Bahareh Harandizadeh, Amara Tariq, Halim Abbas, Umar Ghaffar, Christopher J Warren, Segun O. Kolade, Haidar M. Abdul-Muhsin
Main category: cs.CL
TL;DR: LLMs can serve as “empathy editors” to enhance physicians’ written responses by improving empathetic tone while preserving medical accuracy, offering a safer approach than fully AI-generated outputs.
Details
Motivation: Physicians face challenges balancing emotional warmth with factual precision under clinical constraints. There's a need for tools that can enhance empathetic communication while maintaining medical accuracy in healthcare settings.Method: The study uses LLMs as “empathy editors” to refine physicians’ written responses. It introduces two novel quantitative metrics: Empathy Ranking Score (for emotional quality) and MedFactChecking Score (for factual accuracy) to systematically assess responses.
Result: LLM-edited responses significantly increase perceived empathy while preserving factual accuracy compared to fully LLM-generated outputs. The editorial approach outperforms autonomous generation in balancing emotional and factual quality.
Conclusion: Using LLMs as editorial assistants rather than autonomous generators provides a safer, more effective pathway to empathetic and trustworthy AI-assisted healthcare communication, maintaining the physician’s expertise while enhancing empathetic expression.
Abstract: Clinical empathy is essential for patient care, but physicians need continually balance emotional warmth with factual precision under the cognitive and emotional constraints of clinical practice. This study investigates how large language models (LLMs) can function as empathy editors, refining physicians’ written responses to enhance empathetic tone while preserving underlying medical information. More importantly, we introduce novel quantitative metrics, an Empathy Ranking Score and a MedFactChecking Score to systematically assess both emotional and factual quality of the responses. Experimental results show that LLM edited responses significantly increase perceived empathy while preserving factual accuracy compared with fully LLM generated outputs. These findings suggest that using LLMs as editorial assistants, rather than autonomous generators, offers a safer, more effective pathway to empathetic and trustworthy AI-assisted healthcare communication.
[23] YuFeng-XGuard: A Reasoning-Centric, Interpretable, and Flexible Guardrail Model for Large Language Models
Junyu Lin, Meizhen Liu, Xiufeng Huang, Jinfeng Li, Haiwen Hong, Xiaohan Yuan, Yuefeng Chen, Longtao Huang, Hui Xue, Ranjie Duan, Zhikai Chen, Yuchuan Fu, Defeng Li, Lingyao Gao, Yitong Yang
Main category: cs.CL
TL;DR: YuFeng-XGuard is a reasoning-centric guardrail model family for LLMs that provides structured risk predictions with explanations, tiered inference for efficiency, and dynamic policy enforcement without retraining.
Details
Motivation: Existing LLM safety solutions rely on coarse filtering, rapid classification, or post-hoc rules, resulting in limited transparency, inflexible policies, or high inference costs. There's a need for fine-grained, interpretable, and adaptable risk assessment.Method: Develops a reasoning-centric guardrail model that generates structured risk predictions (categories + confidence scores) with natural language explanations. Uses tiered inference: initial risk decision from first token, with on-demand explanatory reasoning. Implements dynamic policy mechanism to decouple risk perception from policy enforcement.
Result: Achieves state-of-the-art performance on diverse public safety benchmarks while maintaining strong efficiency-efficacy trade-offs. Releases both full-capacity and lightweight variants as open models.
Conclusion: YuFeng-XGuard provides an effective solution for fine-grained, interpretable, and adaptable LLM safety guardrails that balance performance, efficiency, and flexibility for real-world deployment.
Abstract: As large language models (LLMs) are increasingly deployed in real-world applications, safety guardrails are required to go beyond coarse-grained filtering and support fine-grained, interpretable, and adaptable risk assessment. However, existing solutions often rely on rapid classification schemes or post-hoc rules, resulting in limited transparency, inflexible policies, or prohibitive inference costs. To this end, we present YuFeng-XGuard, a reasoning-centric guardrail model family designed to perform multi-dimensional risk perception for LLM interactions. Instead of producing opaque binary judgments, YuFeng-XGuard generates structured risk predictions, including explicit risk categories and configurable confidence scores, accompanied by natural language explanations that expose the underlying reasoning process. This formulation enables safety decisions that are both actionable and interpretable. To balance decision latency and explanatory depth, we adopt a tiered inference paradigm that performs an initial risk decision based on the first decoded token, while preserving ondemand explanatory reasoning when required. In addition, we introduce a dynamic policy mechanism that decouples risk perception from policy enforcement, allowing safety policies to be adjusted without model retraining. Extensive experiments on a diverse set of public safety benchmarks demonstrate that YuFeng-XGuard achieves stateof-the-art performance while maintaining strong efficiency-efficacy trade-offs. We release YuFeng-XGuard as an open model family, including both a full-capacity variant and a lightweight version, to support a wide range of deployment scenarios.
[24] Parallelism and Generation Order in Masked Diffusion Language Models: Limits Today, Potential Tomorrow
Yangyang Zhong, Yanmei Gu, Zhengqing Zang, Xiaomeng Li, Yuqi Ding, Xibei Jia, Yuting Shen, Zhenzhong Lan, Liwang Zhu, Weiping Liu, Junlin Zhou, Haisheng Liu, Zhong Xin Yu, Pengxin Luo, Donglian Qi, Yunfeng Yan, Junbo Zhao
Main category: cs.CL
TL;DR: MDLMs lag behind autoregressive models due to weakened inter-token dependencies, but show adaptive decoding behavior and advantages on tasks requiring backward information. A Generate-then-Edit paradigm is proposed to mitigate dependency loss while keeping parallel decoding efficiency.
Details
Motivation: To characterize the true capabilities of Masked Diffusion Language Models (MDLMs) regarding parallel token generation and arbitrary-order decoding, and understand their performance limitations compared to autoregressive models.Method: Evaluated 8 mainstream MDLMs (up to 100B parameters) on 58 benchmarks across knowledge, reasoning, and programming domains. Used Average Finalization Parallelism (AFP) and Kendall’s tau to measure parallelism strength and generation order.
Result: MDLMs still underperform comparably sized autoregressive models due to weakened inter-token dependencies from parallel probabilistic modeling. They show adaptive decoding behavior where parallelism and generation order vary by task domain, reasoning stage, and correctness. On tasks requiring “backward information” like Sudoku, MDLMs fill easier blanks first, showing advantages.
Conclusion: Proposes a Generate-then-Edit paradigm that mitigates dependency loss while retaining parallel decoding efficiency, addressing MDLMs’ current limitations.
Abstract: Masked Diffusion Language Models (MDLMs) promise parallel token generation and arbitrary-order decoding, yet it remains unclear to what extent current models truly realize these capabilities. We characterize MDLM behavior along two dimensions – parallelism strength and generation order – using Average Finalization Parallelism (AFP) and Kendall’s tau. We evaluate eight mainstream MDLMs (up to 100B parameters) on 58 benchmarks spanning knowledge, reasoning, and programming. The results show that MDLMs still lag behind comparably sized autoregressive models, mainly because parallel probabilistic modeling weakens inter-token dependencies. Meanwhile, MDLMs exhibit adaptive decoding behavior: their parallelism and generation order vary significantly with the task domain, the stage of reasoning, and whether the output is correct. On tasks that require “backward information” (e.g., Sudoku), MDLMs adopt a solution order that tends to fill easier Sudoku blanks first, highlighting their advantages. Finally, we provide theoretical motivation and design insights supporting a Generate-then-Edit paradigm, which mitigates dependency loss while retaining the efficiency of parallel decoding.
[25] ToxiTwitch: Toward Emote-Aware Hybrid Moderation for Live Streaming Platforms
Baktash Ansari, Shiza Ali, Elias Martin, Maryna Sivachenko, Afra Mashhadi
Main category: cs.CL
TL;DR: Hybrid LLM-based model (ToxiTwitch) combining text and emote embeddings improves toxicity detection accuracy on Twitch by 13% over BERT, reaching 80% accuracy with channel-specific training.
Details
Motivation: Traditional moderation approaches (human annotation, keyword filtering) struggle to scale in Twitch's fast-paced, high-volume, context-rich chat environment where human moderators face harassment. Recent LLM advances offer new opportunities for toxicity detection, especially for nuanced multimodal communication involving emotes.Method: Introduces ToxiTwitch, a hybrid model combining LLM-generated embeddings of text and emotes (using models like DeepSeek-R1-Distill and Llama-3-8B-Instruct) with traditional ML classifiers (Random Forest and SVM). Incorporates emote analysis to improve toxicity detection.
Result: Incorporating emotes improves toxic behavior detection. The hybrid approach reaches up to 80% accuracy under channel-specific training (13% improvement over BERT) with F1-score of 76%. Demonstrates effectiveness of emote-aware toxicity detection.
Conclusion: This exploratory study shows that emote-aware toxicity detection using hybrid LLM approaches can significantly improve moderation on Twitch, though it also surfaces challenges and limits of such approaches in the complex Twitch chat environment.
Abstract: The rapid growth of live-streaming platforms such as Twitch has introduced complex challenges in moderating toxic behavior. Traditional moderation approaches, such as human annotation and keyword-based filtering, have demonstrated utility, but human moderators on Twitch constantly struggle to scale effectively in the fast-paced, high-volume, and context-rich chat environment of the platform while also facing harassment themselves. Recent advances in large language models (LLMs), such as DeepSeek-R1-Distill and Llama-3-8B-Instruct, offer new opportunities for toxicity detection, especially in understanding nuanced, multimodal communication involving emotes. In this work, we present an exploratory comparison of toxicity detection approaches tailored to Twitch. Our analysis reveals that incorporating emotes improves the detection of toxic behavior. To this end, we introduce ToxiTwitch, a hybrid model that combines LLM-generated embeddings of text and emotes with traditional machine learning classifiers, including Random Forest and SVM. In our case study, the proposed hybrid approach reaches up to 80 percent accuracy under channel-specific training (with 13 percent improvement over BERT and F1-score of 76 percent). This work is an exploratory study intended to surface challenges and limits of emote-aware toxicity detection on Twitch.
[26] Towards Reliable Medical LLMs: Benchmarking and Enhancing Confidence Estimation of Large Language Models in Medical Consultation
Zhiyao Ren, Yibing Zhan, Siyuan Liang, Guozheng Ma, Baosheng Yu, Dacheng Tao
Main category: cs.CL
TL;DR: First benchmark for assessing LLM confidence in multi-turn medical consultations, showing MedConf framework outperforms SOTA methods by grounding confidence in evidence completeness.
Details
Motivation: Current LLM confidence evaluation is limited to single-turn, static settings, ignoring how confidence should evolve with accumulating clinical evidence during real consultations, which risks misdiagnosis from incomplete information.Method: Proposed benchmark with three medical data types and information sufficiency gradient; developed MedConf framework that constructs symptom profiles via RAG, aligns patient information with supporting/missing/contradictory relations, and aggregates into interpretable confidence estimates.
Result: MedConsistently outperforms 27 representative methods across two LLMs and three datasets on AUROC and Pearson correlation metrics, maintaining stable performance under information insufficiency and multimorbidity conditions.
Conclusion: Information adequacy is crucial for credible medical confidence modeling; MedConf provides a pathway toward more reliable and interpretable large medical models by grounding confidence in evidence completeness.
Abstract: Large-scale language models (LLMs) often offer clinical judgments based on incomplete information, increasing the risk of misdiagnosis. Existing studies have primarily evaluated confidence in single-turn, static settings, overlooking the coupling between confidence and correctness as clinical evidence accumulates during real consultations, which limits their support for reliable decision-making. We propose the first benchmark for assessing confidence in multi-turn interaction during realistic medical consultations. Our benchmark unifies three types of medical data for open-ended diagnostic generation and introduces an information sufficiency gradient to characterize the confidence-correctness dynamics as evidence increases. We implement and compare 27 representative methods on this benchmark; two key insights emerge: (1) medical data amplifies the inherent limitations of token-level and consistency-level confidence methods, and (2) medical reasoning must be evaluated for both diagnostic accuracy and information completeness. Based on these insights, we present MedConf, an evidence-grounded linguistic self-assessment framework that constructs symptom profiles via retrieval-augmented generation, aligns patient information with supporting, missing, and contradictory relations, and aggregates them into an interpretable confidence estimate through weighted integration. Across two LLMs and three medical datasets, MedConf consistently outperforms state-of-the-art methods on both AUROC and Pearson correlation coefficient metrics, maintaining stable performance under conditions of information insufficiency and multimorbidity. These results demonstrate that information adequacy is a key determinant of credible medical confidence modeling, providing a new pathway toward building more reliable and interpretable large medical models.
[27] What Patients Really Ask: Exploring the Effect of False Assumptions in Patient Information Seeking
Raymond Xiong, Furong Jia, Lionel Wong, Monica Agrawal
Main category: cs.CL
TL;DR: LLMs struggle to identify incorrect assumptions in real patient questions, unlike medical exam questions, revealing a dangerous gap in healthcare QA benchmarks.
Details
Motivation: Current LLM benchmarking for medical QA focuses on exam-style questions, which don't reflect real patient questions that often contain incorrect assumptions and dangerous intentions.Method: Created dataset from Google’s People Also Ask feature using top 200 prescribed medications in the US, analyzing patterns of incorrect assumptions in patient questions.
Result: Found many patient questions contain incorrect assumptions; emergence of these corrupted questions depends on degree of incorrectness in question history; current LLMs fail to identify these issues despite performing well on other benchmarks.
Conclusion: There’s a critical need for better LLM evaluation on real patient questions with incorrect assumptions, as current benchmarks don’t capture this important safety concern.
Abstract: Patients are increasingly using large language models (LLMs) to seek answers to their healthcare-related questions. However, benchmarking efforts in LLMs for question answering often focus on medical exam questions, which differ significantly in style and content from the questions patients actually raise in real life. To bridge this gap, we sourced data from Google’s People Also Ask feature by querying the top 200 prescribed medications in the United States, curating a dataset of medical questions people commonly ask. A considerable portion of the collected questions contains incorrect assumptions and dangerous intentions. We demonstrate that the emergence of these corrupted questions is not uniformly random and depends heavily on the degree of incorrectness in the history of questions that led to their appearance. Current LLMs that perform strongly on other benchmarks struggle to identify incorrect assumptions in everyday questions.
[28] Persona Switch: Mixing Distinct Perspectives in Decoding Time
Junseok Kim, Nakyeong Yang, Kyomin Jung
Main category: cs.CL
TL;DR: Persona Switch: A novel decoding method that dynamically combines zero-shot and role-play prompting by selecting the better output at each step based on confidence scores.
Details
Motivation: Role-play prompting improves LLM reasoning but inconsistently across tasks, suggesting complementary strengths with zero-shot prompting rather than universal superiority.Method: Persona Switch dynamically combines zero-shot and role-play prompting by comparing their output confidence (logit gap) at each decoding step and selecting the more confident output.
Result: Experiments show Persona Switch consistently outperforms baselines with up to 5.13% accuracy improvement, and output confidence proves to be an informative measure for selecting reliable outputs.
Conclusion: Dynamically switching between zero-shot and role-play prompting based on confidence scores effectively combines their complementary strengths for improved LLM performance.
Abstract: Role-play prompting is known to steer the behavior of language models by injecting a persona into the prompt, improving their zero-shot reasoning capabilities. However, such improvements are inconsistent across different tasks or instances. This inconsistency suggests that zero-shot and role-play prompting may offer complementary strengths rather than one being universally superior. Building on this insight, we propose Persona Switch, a novel decoding method that dynamically combines the benefits of both prompting strategies. Our method proceeds step-by-step, selecting the better output between zero-shot and role-play prompting at each step by comparing their output confidence, as measured by the logit gap. Experiments with widely-used LLMs demonstrate that Persona Switch consistently outperforms competitive baselines, achieving up to 5.13% accuracy improvement. Furthermore, we show that output confidence serves as an informative measure for selecting the more reliable output.
[29] Dancing in Chains: Strategic Persuasion in Academic Rebuttal via Theory of Mind
Zhitao He, Zongwei Lyu, Yi R Fung
Main category: cs.CL
TL;DR: RebuttalAgent: First ToM-grounded framework for academic rebuttal using TSR pipeline, trained on RebuttalBench via SFT+RL, outperforms base models by 18.3%.
Details
Motivation: Academic rebuttal is a complex strategic communication process under information asymmetry, not just technical debate. Current approaches fail because they only imitate surface-level linguistics without perspective-taking needed for effective persuasion.Method: Introduces RebuttalAgent with ToM-Strategy-Response (TSR) pipeline: models reviewer mental state, formulates persuasion strategy, generates strategy-grounded response. Trained on RebuttalBench dataset via critique-and-refine synthesis. Two-stage training: supervised fine-tuning for ToM analysis/planning, then reinforcement learning with self-reward mechanism. Also develops Rebuttal-RM evaluator trained on 100K+ multi-source rebuttal data.
Result: RebuttalAgent outperforms base model by average 18.3% on automated metrics, beats advanced proprietary models in both automated and human evaluations. Rebuttal-RM achieves scoring consistency with human preferences surpassing GPT-4.1.
Conclusion: First successful grounding of academic rebuttal in Theory of Mind, demonstrating effectiveness of ToM-based approach for complex persuasive communication tasks. Framework provides reference for authors but doesn’t replace critical analysis.
Abstract: Although artificial intelligence (AI) has become deeply integrated into various stages of the research workflow and achieved remarkable advancements, academic rebuttal remains a significant and underexplored challenge. This is because rebuttal is a complex process of strategic communication under severe information asymmetry rather than a simple technical debate. Consequently, current approaches struggle as they largely imitate surface-level linguistics, missing the essential element of perspective-taking required for effective persuasion. In this paper, we introduce RebuttalAgent, the first framework to ground academic rebuttal in Theory of Mind (ToM), operationalized through a ToM-Strategy-Response (TSR) pipeline that models reviewer mental state, formulates persuasion strategy, and generates strategy-grounded response. To train our agent, we construct RebuttalBench, a large-scale dataset synthesized via a novel critique-and-refine approach. Our training process consists of two stages, beginning with a supervised fine-tuning phase to equip the agent with ToM-based analysis and strategic planning capabilities, followed by a reinforcement learning phase leveraging the self-reward mechanism for scalable self-improvement. For reliable and efficient automated evaluation, we further develop Rebuttal-RM, a specialized evaluator trained on over 100K samples of multi-source rebuttal data, which achieves scoring consistency with human preferences surpassing powerful judge GPT-4.1. Extensive experiments show RebuttalAgent significantly outperforms the base model by an average of 18.3% on automated metrics, while also outperforming advanced proprietary models across both automated and human evaluations. Disclaimer: the generated rebuttal content is for reference only to inspire authors and assist in drafting. It is not intended to replace the author’s own critical analysis and response.
[30] Hallucination Mitigating for Medical Report Generation
Ruoqing Zhao, Runze Xia, Piji Li
Main category: cs.CL
TL;DR: KERM framework reduces hallucinations in medical report generation by integrating knowledge retrieval with fine-grained reinforcement learning rewards.
Details
Motivation: Large vision language models (LVLMs) in medical report generation suffer from hallucinations (plausible but inaccurate claims), which is particularly dangerous in the critical medical domain where accuracy is essential for patient care.Method: KERM framework: 1) Uses MedCLIP for knowledge retrieval from a curated medical corpus to get relevant lesion fact sentences, 2) Adds a purification module to ensure retrieved knowledge is contextually relevant to patient’s clinical context, 3) Employs fine-grained reinforcement learning rewards to guide models toward generating clinically relevant and accurate descriptions aligned with desired behaviors.
Result: Experimental validation on IU-Xray and MIMIC-CXR datasets shows the approach effectively mitigates hallucinations and enhances overall report quality compared to baseline methods.
Conclusion: The KERM framework successfully addresses the hallucination problem in medical report generation by combining knowledge enhancement with fine-grained reinforcement learning, making LVLMs more reliable for clinical applications.
Abstract: In the realm of medical report generation (MRG), the integration of natural language processing has emerged as a vital tool to alleviate the workload of radiologists. Despite the impressive capabilities demonstrated by large vision language models (LVLMs) in understanding natural language, their susceptibility to generating plausible yet inaccurate claims, known as ``hallucinations’’, raises concerns-especially in the nuanced and critical field of medical. In this work, we introduce a framework, \textbf{K}nowledge-\textbf{E}nhanced with Fine-Grained \textbf{R}einforced Rewards \textbf{M}edical Report Generation (KERM), to tackle the issue. Our approach refines the input to the LVLM by first utilizing MedCLIP for knowledge retrieval, incorporating relevant lesion fact sentences from a curated knowledge corpus. We then introduce a novel purification module to ensure the retrieved knowledge is contextually relevant to the patient’s clinical context. Subsequently, we employ fine-grained rewards to guide these models in generating highly supportive and clinically relevant descriptions, ensuring the alignment of model’s outputs with desired behaviors. Experimental results on IU-Xray and MIMIC-CXR datasets validate the effectiveness of our approach in mitigating hallucinations and enhancing report quality.
[31] Beyond Marginal Distributions: A Framework to Evaluate the Representativeness of Demographic-Aligned LLMs
Tristan Williams, Franziska Weeber, Sebastian Padó, Alan Akbik
Main category: cs.CL
TL;DR: Current LLM alignment focuses on marginal response distributions but misses deeper correlation patterns that characterize real populations. A new framework evaluates representativeness through multivariate correlations, showing both persona prompting and demographic fine-tuning fail to capture gold standard correlation structures.
Details
Motivation: LLMs are increasingly used to represent human opinions and values, but existing alignment research focuses too narrowly on marginal response distributions, treating survey items independently. This overlooks deeper latent structures that characterize real populations and underpin cultural values theories.Method: Proposed a framework for evaluating model representativeness through multivariate correlation patterns in addition to marginal distributions. Compared two model steering techniques (persona prompting and demographic fine-tuning) against human responses from the World Values Survey.
Result: While demographically fine-tuned models better approximate marginal response distributions than persona prompting, both techniques fail to fully capture the gold standard correlation patterns found in human survey data.
Conclusion: Representativeness is a distinct aspect of value alignment, and evaluation focused only on marginals can mask structural failures, leading to overly optimistic conclusions about model capabilities. Multivariate correlation analysis is essential for proper assessment.
Abstract: Large language models are increasingly used to represent human opinions, values, or beliefs, and their steerability towards these ideals is an active area of research. Existing work focuses predominantly on aligning marginal response distributions, treating each survey item independently. While essential, this may overlook deeper latent structures that characterise real populations and underpin cultural values theories. We propose a framework for evaluating the representativeness of aligned models through multivariate correlation patterns in addition to marginal distributions. We show the value of our evaluation scheme by comparing two model steering techniques (persona prompting and demographic fine-tuning) and evaluating them against human responses from the World Values Survey. While the demographically fine-tuned model better approximates marginal response distributions than persona prompting, both techniques fail to fully capture the gold standard correlation patterns. We conclude that representativeness is a distinct aspect of value alignment and an evaluation focused on marginals can mask structural failures, leading to overly optimistic conclusions about model capabilities.
[32] HumanLLM: Towards Personalized Understanding and Simulation of Human Nature
Yuxuan Lei, Tianfu Wang, Jianxun Lian, Zhengyu Hu, Defu Lian, Xing Xie
Main category: cs.CL
TL;DR: HumanLLM is a foundation model designed for personalized human understanding and simulation, trained on real-world user data to predict individualized behaviors, thoughts, and experiences.
Details
Motivation: While LLMs excel at objective tasks, they lack nuanced understanding of human cognition and behavior needed for social simulation and personalized applications. This limitation comes from standard pretraining on uncontextualized web data that doesn't capture continuous, situated individual contexts over time.Method: Created Cognitive Genome Dataset from real-world user data (Reddit, Twitter, Blogger, Amazon) using multi-stage pipeline for filtering, synthesis, and quality control. Extracted 5.5M+ user logs to distill profiles, behaviors, and thinking patterns. Formulated diverse learning tasks and performed supervised fine-tuning for personalized prediction.
Result: HumanLLM achieves superior performance in predicting user actions/thoughts, more accurately mimics writing styles/preferences, generates more authentic user profiles compared to base models. Shows significant gains on out-of-domain social intelligence benchmarks, indicating enhanced generalization.
Conclusion: HumanLLM bridges the gap between standard LLM training and personalized human understanding, enabling more effective social simulation and personalized applications by capturing continuous individual contexts.
Abstract: Motivated by the remarkable progress of large language models (LLMs) in objective tasks like mathematics and coding, there is growing interest in their potential to simulate human behavior–a capability with profound implications for transforming social science research and customer-centric business insights. However, LLMs often lack a nuanced understanding of human cognition and behavior, limiting their effectiveness in social simulation and personalized applications. We posit that this limitation stems from a fundamental misalignment: standard LLM pretraining on vast, uncontextualized web data does not capture the continuous, situated context of an individual’s decisions, thoughts, and behaviors over time. To bridge this gap, we introduce HumanLLM, a foundation model designed for personalized understanding and simulation of individuals. We first construct the Cognitive Genome Dataset, a large-scale corpus curated from real-world user data on platforms like Reddit, Twitter, Blogger, and Amazon. Through a rigorous, multi-stage pipeline involving data filtering, synthesis, and quality control, we automatically extract over 5.5 million user logs to distill rich profiles, behaviors, and thinking patterns. We then formulate diverse learning tasks and perform supervised fine-tuning to empower the model to predict a wide range of individualized human behaviors, thoughts, and experiences. Comprehensive evaluations demonstrate that HumanLLM achieves superior performance in predicting user actions and inner thoughts, more accurately mimics user writing styles and preferences, and generates more authentic user profiles compared to base models. Furthermore, HumanLLM shows significant gains on out-of-domain social intelligence benchmarks, indicating enhanced generalization.
[33] SteerEval: Inference-time Interventions Strengthen Multilingual Generalization in Neural Summarization Metrics
Silvia Casola, Ryan Soh-Eun Shim, Felicia Körner, Yuchen Mao, Barbara Plank
Main category: cs.CL
TL;DR: Multilingual metrics can be improved by steering their activations toward English pivot language at test time.
Details
Motivation: Shortage of accurate multilingual evaluation metrics hinders progress in multilingual NLG; multilingual models often use English as internal pivot, and misalignment with this pivot degrades performance - this may also apply to multilingual neural metrics.Method: Apply test-time intervention methods to steer activations of encoder- and decoder-based metrics toward English pivot language.
Result: Test-time intervention methods are effective across the board, increasing metric effectiveness for diverse languages.
Conclusion: Steering multilingual neural metrics’ activations toward English pivot language improves correlation with human judgments, addressing evaluation bottleneck in multilingual NLG.
Abstract: An increasing body of work has leveraged multilingual language models for Natural Language Generation tasks such as summarization. A major empirical bottleneck in this area is the shortage of accurate and robust evaluation metrics for many languages, which hinders progress. Recent studies suggest that multilingual language models often use English as an internal pivot language, and that misalignment with this pivot can lead to degraded downstream performance. Motivated by the hypothesis that this mismatch could also apply to multilingual neural metrics, we ask whether steering their activations toward an English pivot can improve correlation with human judgments. We experiment with encoder- and decoder-based metrics and find that test-time intervention methods are effective across the board, increasing metric effectiveness for diverse languages.
[34] ExDR: Explanation-driven Dynamic Retrieval Enhancement for Multimodal Fake News Detection
Guoxuan Ding, Yuqing Li, Ziyan Zhou, Zheng Lin, Daren Zha, Jiangnan Li
Main category: cs.CL
TL;DR: ExDR: An explanation-driven dynamic retrieval-augmented generation framework that improves multimodal fake news detection by using model-generated explanations to enhance retrieval triggering and evidence selection.
Details
Motivation: Multimodal fake news is rapidly spreading and evolving, posing serious societal threats. Existing detection methods struggle with timely factual details and evolving deceptive content. While dynamic retrieval-augmented generation offers promise, it still faces challenges like redundant retrieval, coarse similarity matching, and irrelevant evidence selection.Method: ExDR systematically leverages model-generated explanations in both retrieval triggering and evidence retrieval modules. It assesses triggering confidence from three complementary dimensions, constructs entity-aware indices by fusing deceptive entities, and retrieves contrastive evidence based on deception-specific features to challenge initial claims and enhance predictions.
Result: Experiments on two benchmark datasets (AMG and MR2) show ExDR consistently outperforms previous methods in retrieval triggering accuracy, retrieval quality, and overall detection performance, demonstrating effectiveness and generalization capability.
Conclusion: The proposed ExDR framework effectively addresses limitations of existing dynamic retrieval-augmented generation approaches for multimodal fake news detection by integrating explanation-driven mechanisms that improve both retrieval triggering and evidence selection processes.
Abstract: The rapid spread of multimodal fake news poses a serious societal threat, as its evolving nature and reliance on timely factual details challenge existing detection methods. Dynamic Retrieval-Augmented Generation provides a promising solution by triggering keyword-based retrieval and incorporating external knowledge, thus enabling both efficient and accurate evidence selection. However, it still faces challenges in addressing issues such as redundant retrieval, coarse similarity, and irrelevant evidence when applied to deceptive content. In this paper, we propose ExDR, an Explanation-driven Dynamic Retrieval-Augmented Generation framework for Multimodal Fake News Detection. Our framework systematically leverages model-generated explanations in both the retrieval triggering and evidence retrieval modules. It assesses triggering confidence from three complementary dimensions, constructs entity-aware indices by fusing deceptive entities, and retrieves contrastive evidence based on deception-specific features to challenge the initial claim and enhance the final prediction. Experiments on two benchmark datasets, AMG and MR2, demonstrate that ExDR consistently outperforms previous methods in retrieval triggering accuracy, retrieval quality, and overall detection performance, highlighting its effectiveness and generalization capability.
[35] Can professional translators identify machine-generated text?
Michael Farrell
Main category: cs.CL
TL;DR: Professional translators can identify AI-generated Italian short stories with some reliability (16.2% success rate), but nearly equal numbers misclassify in opposite direction, often preferring AI texts.
Details
Motivation: To determine whether professional translators without specialized AI training can reliably distinguish between AI-generated and human-written Italian short stories, and to understand what linguistic features influence their judgments.Method: In-person experiment with 69 translators assessing three anonymized short stories (two by ChatGPT-4o, one by human author). Participants rated likelihood of AI authorship and provided justifications for each story.
Result: While average results were inconclusive, 16.2% successfully distinguished synthetic from human texts. Low burstiness and narrative contradictions were reliable AI indicators, while grammatical accuracy and emotional tone often led to misclassification. Unexpected calques and syntactic transfer from English were also reported.
Conclusion: Professional translators can identify AI-generated texts with some analytical skill, but subjective preferences often lead to misclassification. Findings question the role of synthetic-text editing in professional contexts, as translators may actually prefer AI-generated content.
Abstract: This study investigates whether professional translators can reliably identify short stories generated in Italian by artificial intelligence (AI) without prior specialized training. Sixty-nine translators took part in an in-person experiment, where they assessed three anonymized short stories - two written by ChatGPT-4o and one by a human author. For each story, participants rated the likelihood of AI authorship and provided justifications for their choices. While average results were inconclusive, a statistically significant subset (16.2%) successfully distinguished the synthetic texts from the human text, suggesting that their judgements were informed by analytical skill rather than chance. However, a nearly equal number misclassified the texts in the opposite direction, often relying on subjective impressions rather than objective markers, possibly reflecting a reader preference for AI-generated texts. Low burstiness and narrative contradiction emerged as the most reliable indicators of synthetic authorship, with unexpected calques, semantic loans and syntactic transfer from English also reported. In contrast, features such as grammatical accuracy and emotional tone frequently led to misclassification. These findings raise questions about the role and scope of synthetic-text editing in professional contexts.
[36] Determinants of Training Corpus Size for Clinical Text Classification
Jaya Chaturvedi, Saniya Deshpande, Chenkai Ma, Robert Cobb, Angus Roberts, Robert Stewart, Daniel Stahl, Diana Shamsutdinova
Main category: cs.CL
TL;DR: Clinical text classification requires 600 documents to achieve 95% of maximum performance, with vocabulary properties (strong vs. noisy predictors) explaining performance differences across tasks.
Details
Motivation: Current clinical text classification lacks justification for sample size requirements (typically 200-500 documents) and understanding of how text vocabulary properties affect performance.Method: Used MIMIC-III discharge notes with ICD-9 diagnoses, pre-trained BERT embeddings with Random Forest classifiers for 10 diagnoses, varied training sizes (100-10,000 documents), and analyzed vocabulary using Lasso logistic regression on bag-of-words embeddings.
Result: 600 documents sufficient for 95% of maximum performance; vocabulary analysis showed strong predictors increase accuracy (+0.04 per 100 words) while noisy predictors decrease accuracy (-0.02 per 100 words).
Conclusion: 600 documents is a practical sample size for clinical text classification, and vocabulary properties (strong vs. noisy predictors) significantly impact learning curve steepness and performance.
Abstract: Introduction: Clinical text classification using natural language processing (NLP) models requires adequate training data to achieve optimal performance. For that, 200-500 documents are typically annotated. The number is constrained by time and costs and lacks justification of the sample size requirements and their relationship to text vocabulary properties. Methods: Using the publicly available MIMIC-III dataset containing hospital discharge notes with ICD-9 diagnoses as labels, we employed pre-trained BERT embeddings followed by Random Forest classifiers to identify 10 randomly selected diagnoses, varying training corpus sizes from 100 to 10,000 documents, and analyzed vocabulary properties by identifying strong and noisy predictive words through Lasso logistic regression on bag-of-words embeddings. Results: Learning curves varied significantly across the 10 classification tasks despite identical preprocessing and algorithms, with 600 documents sufficient to achieve 95% of the performance attainable with 10,000 documents for all tasks. Vocabulary analysis revealed that more strong predictors and fewer noisy predictors were associated with steeper learning curves, where every 100 additional noisy words decreased accuracy by approximately 0.02 while 100 additional strong predictors increased maximum accuracy by approximately 0.04.
[37] Artificial Rigidities vs. Biological Noise: A Comparative Analysis of Multisensory Integration in AV-HuBERT and Human Observers
Francisco Portillo López
Main category: cs.CL
TL;DR: AV-HuBERT shows similar auditory dominance to humans in McGurk effect tests but has deterministic phonetic fusion bias and lacks human perceptual variability.
Details
Motivation: To evaluate whether self-supervised audiovisual models like AV-HuBERT capture the perceptual bio-fidelity of human speech processing, specifically how they handle incongruent audiovisual stimuli compared to human observers.Method: Benchmarked AV-HuBERT’s response to incongruent audiovisual stimuli (McGurk effect) against 44 human observers, comparing auditory dominance rates and phonetic fusion patterns.
Result: AI and humans showed nearly identical auditory dominance rates (32.0% vs. 31.8%), but AV-HuBERT had significantly higher deterministic phonetic fusion (68.0% vs. 47.7%) and lacked human perceptual stochasticity and diverse error profiles.
Conclusion: Current self-supervised architectures can mimic multisensory outcomes but lack the neural variability inherent to human speech perception, showing deterministic biases where humans display perceptual stochasticity.
Abstract: This study evaluates AV-HuBERT’s perceptual bio-fidelity by benchmarking its response to incongruent audiovisual stimuli (McGurk effect) against human observers (N=44). Results reveal a striking quantitative isomorphism: AI and humans exhibited nearly identical auditory dominance rates (32.0% vs. 31.8%), suggesting the model captures biological thresholds for auditory resistance. However, AV-HuBERT showed a deterministic bias toward phonetic fusion (68.0%), significantly exceeding human rates (47.7%). While humans displayed perceptual stochasticity and diverse error profiles, the model remained strictly categorical. Findings suggest that current self-supervised architectures mimic multisensory outcomes but lack the neural variability inherent to human speech perception.
[38] Stable-DiffCoder: Pushing the Frontier of Code Diffusion Large Language Model
Chenghao Fan, Wen Heng, Bo Li, Sichen Liu, Yuxuan Song, Jing Su, Xiaoye Qu, Kai Shen, Wei Wei
Main category: cs.CL
TL;DR: Stable-DiffCoder is a block diffusion code model that outperforms autoregressive counterparts on code benchmarks through efficient training techniques and demonstrates advantages in structured code modeling and low-resource languages.
Details
Motivation: Diffusion-based language models offer non-sequential generation and richer data reuse compared to autoregressive models, but existing code diffusion models still lag behind strong AR baselines under comparable budgets. The authors aim to bridge this performance gap.Method: Stable-DiffCoder reuses the Seed-Coder architecture, data, and training pipeline. It incorporates a block diffusion continual pretraining stage enhanced by tailored warmup and block-wise clipped noise schedule for efficient knowledge learning and stable training. The approach includes CPT and supervised fine-tuning stages.
Result: Under the same data and architecture, Stable-DiffCoder overall outperforms its AR counterpart on a broad suite of code benchmarks. It achieves stronger performance than a wide range of ~8B ARs and DLLMs, demonstrating that diffusion-based training can improve code modeling quality beyond AR training alone.
Conclusion: Diffusion-based any-order modeling improves structured code modeling for editing and reasoning, and through data augmentation, benefits low-resource coding languages. The work shows diffusion-based training can surpass autoregressive approaches for code modeling.
Abstract: Diffusion-based language models (DLLMs) offer non-sequential, block-wise generation and richer data reuse compared to autoregressive (AR) models, but existing code DLLMs still lag behind strong AR baselines under comparable budgets. We revisit this setting in a controlled study and introduce Stable-DiffCoder, a block diffusion code model that reuses the Seed-Coder architecture, data, and training pipeline. To enable efficient knowledge learning and stable training, we incorporate a block diffusion continual pretraining (CPT) stage enhanced by a tailored warmup and block-wise clipped noise schedule. Under the same data and architecture, Stable-DiffCoder overall outperforms its AR counterpart on a broad suite of code benchmarks. Moreover, relying only on the CPT and supervised fine-tuning stages, Stable-DiffCoder achieves stronger performance than a wide range of ~8B ARs and DLLMs, demonstrating that diffusion-based training can improve code modeling quality beyond AR training alone. Moreover, diffusion-based any-order modeling improves structured code modeling for editing and reasoning, and through data augmentation, benefits low-resource coding languages.
[39] Transfer Learning from ImageNet for MEG-Based Decoding of Imagined Speech
Soufiane Jhilal, Stéphanie Martin, Anne-Lise Giraud
Main category: cs.CL
TL;DR: The paper introduces an image-based approach using pretrained vision models to decode imagined speech from MEG signals, achieving high accuracy across different tasks.
Details
Motivation: Non-invasive decoding of imagined speech is challenging due to weak, distributed neural signals and limited labeled data. Current methods struggle with these limitations.Method: Transform MEG signals into time-frequency representations using learnable sensor-space convolution to create three spatial scalogram mixtures. These image-like inputs are fed into ImageNet-pretrained vision architectures for decoding imagined speech.
Result: Pretrained vision models outperformed classical and non-pretrained models, achieving: 90.4% balanced accuracy for imagery vs. silence, 81.0% for imagery vs. silent reading, and 60.6% for vowel decoding. Cross-subject evaluation showed models capture shared neural representations.
Conclusion: Pretrained vision models applied to image-based MEG representations can effectively capture the structure of imagined speech in non-invasive neural signals, demonstrating transfer learning potential for neural decoding tasks.
Abstract: Non-invasive decoding of imagined speech remains challenging due to weak, distributed signals and limited labeled data. Our paper introduces an image-based approach that transforms magnetoencephalography (MEG) signals into time-frequency representations compatible with pretrained vision models. MEG data from 21 participants performing imagined speech tasks were projected into three spatial scalogram mixtures via a learnable sensor-space convolution, producing compact image-like inputs for ImageNet-pretrained vision architectures. These models outperformed classical and non-pretrained models, achieving up to 90.4% balanced accuracy for imagery vs. silence, 81.0% vs. silent reading, and 60.6% for vowel decoding. Cross-subject evaluation confirmed that pretrained models capture shared neural representations, and temporal analyses localized discriminative information to imagery-locked intervals. These findings show that pretrained vision models applied to image-based MEG representations can effectively capture the structure of imagined speech in non-invasive neural signals.
[40] Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain
Özgür Uğur, Mahmut Göksu, Mahmut Çimen, Musa Yılmaz, Esra Şavirdi, Alp Talha Demir, Rumeysa Güllüce, İclal Çetin, Ömer Can Sağbaş
Main category: cs.CL
TL;DR: Mecellem framework develops specialized Turkish legal language models using domain adaptation: encoder models pre-trained from scratch achieve efficient retrieval performance, and decoder models adapted via continual pre-training show significant perplexity reduction.
Details
Motivation: To create cost-effective, specialized language models for the Turkish legal domain that can compete with state-of-the-art models while requiring less computational resources and simpler training pipelines.Method: Two approaches: (1) ModernBERT-based bidirectional encoders pre-trained from scratch on 112.7B Turkish tokens with checkpoint selection based on retrieval performance; (2) Qwen3 decoder models adapted via four-phase continual pre-training with controlled curriculum learning for gradual domain specialization.
Result: Encoder models achieve top-3 Turkish retrieval rankings with 155M parameters matching larger 307M-567M models, achieving 92.36% production efficiency. Decoder models show 36.2% perplexity reduction on Turkish legal text. Both approaches provide cost-effective alternatives to multi-stage SOTA pipelines.
Conclusion: The Mecellem framework successfully develops specialized Turkish legal language models through efficient domain adaptation strategies, demonstrating that single-stage pre-training with smart checkpoint selection and controlled curriculum learning can achieve competitive performance with reduced computational costs.
Abstract: This paper presents Mecellem models, a framework for developing specialized language models for the Turkish legal domain through domain adaptation strategies. We make two contributions: (1)Encoder Model Pre-trained from Scratch: ModernBERT-based bidirectional encoders pre-trained on a Turkish-dominant corpus of 112.7 billion tokens. We implement a checkpoint selection strategy that evaluates downstream retrieval performance throughout training, revealing that optimal checkpoints achieve best retrieval scores before pre-training loss reaches its minimum. Our encoder models achieve top-3 rankings on the Turkish retrieval leaderboard, with smaller models (155M parameters) achieving comparable performance to larger reference models (307M-567M parameters). Our approach achieves 92.36% production efficiency compared to state-of-the-art models (embeddinggemma-300m: 100.00%, BAAI/bge-m3: 99.54%, newmindai/bge-m3-stsb: 94.38%), ranking fourth overall despite requiring less computational resources. SOTA models rely on multi-stage, computationally intensive training pipelines, making our single-stage pre-training followed by efficient post-training approach a cost-effective alternative; (2)Decoder Model with Continual Pre-training (CPT): Qwen3-1.7B and Qwen3-4B models adapted to Turkish legal domain through controlled curriculum learning. Four-phase CPT with optimal sample ratios enables gradual transition from general language knowledge to specialized legal terminology and long-context reasoning. This approach achieves 36.2% perplexity reduction on Turkish legal text, demonstrating domain adaptation gains.
[41] Universal Refusal Circuits Across LLMs: Cross-Model Transfer via Trajectory Replay and Concept-Basis Reconstruction
Tony Cristofano
Main category: cs.CL
TL;DR: The paper introduces Trajectory Replay via Concept-Basis Reconstruction, a framework that transfers refusal interventions between different LLM architectures without target-side supervision, showing that refusal behavior stems from universal semantic circuits rather than model-specific implementations.
Details
Motivation: Current understanding treats refusal behavior in aligned LLMs as model-specific, but the authors hypothesize it actually originates from universal, low-dimensional semantic circuits shared across different models and architectures.Method: Trajectory Replay via Concept-Basis Reconstruction framework that transfers refusal interventions from donor to target models using concept fingerprints and shared concept atoms. Includes weight-SVD stability guard to project interventions away from high-variance weight subspaces to preserve model capabilities.
Result: Evaluation across 8 model pairs (including GPT-OSS-20B and GLM-4) shows transferred recipes consistently attenuate refusal behavior while maintaining performance, supporting the semantic universality hypothesis.
Conclusion: Refusal behavior in aligned LLMs stems from universal semantic circuits rather than model-specific implementations, providing strong evidence for the semantic universality of safety alignment across diverse architectures and training regimes.
Abstract: Refusal behavior in aligned LLMs is often viewed as model-specific, yet we hypothesize it stems from a universal, low-dimensional semantic circuit shared across models. To test this, we introduce Trajectory Replay via Concept-Basis Reconstruction, a framework that transfers refusal interventions from donor to target models, spanning diverse architectures (e.g., Dense to MoE) and training regimes, without using target-side refusal supervision. By aligning layers via concept fingerprints and reconstructing refusal directions using a shared ``recipe’’ of concept atoms, we map the donor’s ablation trajectory into the target’s semantic space. To preserve capabilities, we introduce a weight-SVD stability guard that projects interventions away from high-variance weight subspaces to prevent collateral damage. Our evaluation across 8 model pairs (including GPT-OSS-20B and GLM-4) confirms that these transferred recipes consistently attenuate refusal while maintaining performance, providing strong evidence for the semantic universality of safety alignment.
[42] Adapter Fusion for Multilingual Text2Cypher with Linear and Learned Gating
Makbule Gulcin Ozsoy
Main category: cs.CL
TL;DR: Multilingual Text2Cypher system using LoRA adapters with fusion MLP achieves 75% of joint fine-tuning accuracy gains while being more data-efficient and scalable for adding new languages.
Details
Motivation: Most Text2SQL/Text2Cypher systems focus on English with limited multilingual support. Need scalable approach to add new languages without expensive full fine-tuning or manual hyperparameter tuning.Method: Train language-specific LoRA adapters for English, Spanish, Turkish, then combine via uniform linear merging or learned fusion MLP with dynamic gating. Fusion MLP learns to combine adapter outputs.
Result: Fusion MLP recovers ~75% accuracy gains of joint multilingual fine-tuning while using smaller data subset, outperforming linear merging across all three languages.
Conclusion: Learned adapter fusion offers practical alternative to expensive joint fine-tuning, balancing performance, data efficiency, and scalability for multilingual Text2Cypher.
Abstract: Large Language Models enable users to access database using natural language interfaces using tools like Text2SQL, Text2SPARQL, and Text2Cypher, which translate user questions into structured database queries. While these systems improve database accessibility, most research focuses on English with limited multilingual support. This work investigates a scalable multilingual Text2Cypher, aiming to support new languages without re-running full fine-tuning, avoiding manual hyper-parameter tuning, and maintaining performance close to joint multilingual fine-tuning. We train language-specific LoRA adapters for English, Spanish, and Turkish and combined them via uniform linear merging or learned fusion MLP with dynamic gating. Experimental results show that the fusion MLP recovers around 75% of the accuracy gains from joint multilingual fine-tuning while requiring only a smaller subset of the data, outperforming linear merging across all three languages. This approach enables incremental language expansion to new languages by requiring only one LoRA adapter and a lightweight MLP retraining. Learned adapter fusion offers a practical alternative to expensive joint fine-tuning, balancing performance, data efficiency, and scalability for multilingual Text2Cypher task.
[43] synthocr-gen: A synthetic ocr dataset generator for low-resource languages- breaking the data barrier
Haq Nawaz Malik, Kh Mohmad Shafi, Tanveer Ahmad Reshi
Main category: cs.CL
TL;DR: SynthOCR-Gen is an open-source synthetic OCR dataset generator for low-resource languages that creates training data from Unicode text corpora, demonstrated with a 600,000-sample Kashmiri dataset.
Details
Motivation: Low-resource languages like Kashmiri lack OCR support due to scarce annotated datasets. Manual creation is expensive and error-prone, creating a bottleneck for OCR development in underserved languages.Method: A comprehensive pipeline that transforms Unicode text corpora into training datasets through text segmentation, Unicode normalization, multi-font rendering, and 25+ data augmentation techniques simulating real-world document degradations.
Result: Successfully generated a 600,000-sample word-segmented Kashmiri OCR dataset released publicly on HuggingFace, providing a practical solution for low-resource language OCR development.
Conclusion: SynthOCR-Gen offers a practical pathway for bringing low-resource languages into vision-language AI models, with the tool openly available for researchers working with underserved writing systems worldwide.
Abstract: Optical Character Recognition (OCR) for low-resource languages remains a significant challenge due to the scarcity of large-scale annotated training datasets. Languages such as Kashmiri, with approximately 7 million speakers and a complex Perso-Arabic script featuring unique diacritical marks, currently lack support in major OCR systems including Tesseract, TrOCR, and PaddleOCR. Manual dataset creation for such languages is prohibitively expensive, time-consuming, and error-prone, often requiring word by word transcription of printed or handwritten text. We present SynthOCR-Gen, an open-source synthetic OCR dataset generator specifically designed for low-resource languages. Our tool addresses the fundamental bottleneck in OCR development by transforming digital Unicode text corpora into ready-to-use training datasets. The system implements a comprehensive pipeline encompassing text segmentation (character, word, n-gram, sentence, and line levels), Unicode normalization with script purity enforcement, multi-font rendering with configurable distribution, and 25+ data augmentation techniques simulating real-world document degradations including rotation, blur, noise, and scanner artifacts. We demonstrate the efficacy of our approach by generating a 600,000-sample word-segmented Kashmiri OCR dataset, which we release publicly on HuggingFace. This work provides a practical pathway for bringing low-resource languages into the era of vision-language AI models, and the tool is openly available for researchers and practitioners working with underserved writing systems worldwide.
[44] Improving Training Efficiency and Reducing Maintenance Costs via Language Specific Model Merging
Alphaeus Dmonte, Vidhi Gupta, Daniel J Perry, Mark Arehart
Main category: cs.CL
TL;DR: Multilingual LLM merging reduces training time by 50% and maintenance costs by 60% while maintaining quality parity, applicable to both academic and industrial use cases.
Details
Motivation: Traditional multilingual LLM fine-tuning requires retraining the entire model when updating languages or adding new ones, creating computational inefficiency and maintenance bottlenecks. Current merging approaches show promise but lack efficiency analysis.Method: Analyzes merging strategy for multilingual multitask models from an efficiency perspective across three independent tasks. Compares merging approach against traditional full retraining, evaluating both initial training and maintenance scenarios.
Result: Merging reduces initial training time by up to 50% and maintenance training costs by more than 60% compared to full multilingual model retraining. Quality remains at parity. Results validated on both public and proprietary industry datasets.
Conclusion: Multilingual model merging offers significant efficiency gains for both initial training and maintenance while maintaining quality, making it practical for industrial applications beyond academic settings.
Abstract: Fine-tuning a task-specific multilingual large language model (LLM) involves training the model on a multilingual dataset with examples in all the required languages. Updating one or more supported languages with additional data or adding support for a new language involves retraining the model, which can be computationally inefficient and creates a severe maintenance bottleneck. Recent research on merging multilingual multitask models has shown promise in terms of improved quality, but its computational and maintenance efficiency remains unstudied. In this work, we provide the first focused analysis of this merging strategy from an efficiency perspective, evaluating it across three independent tasks. We demonstrate significant efficiency gains while maintaining parity in terms of quality: this merging approach reduces the initial training time by up to 50%. We also demonstrate that updating an individual language and re-merging as part of model maintenance reduces training costs by more than 60%, compared to re-training the full multilingual model. We show this on both public and proprietary industry datasets confirming that the approach works well for industrial use cases in addition to academic settings already studied in previous work.
[45] Automatic Classification of Arabic Literature into Historical Eras
Zainab Alhathloul, Irfan Ahmad
Main category: cs.CL
TL;DR: Neural networks used to classify Arabic texts by historical era, achieving up to 0.83 F1-score for binary classification but lower scores for more granular periodization.
Details
Motivation: Arabic language has evolved significantly between classical and modern eras, but there's limited research on automatic classification of Arabic texts by time period, especially beyond poetry.Method: Employed neural networks and deep learning techniques to automatically classify Arabic texts into distinct eras and periods, using two datasets from publicly available corpora covering pre-Islamic to modern era.
Result: F1-scores ranged from 0.83 and 0.79 on binary-era classification using OpenITI and APCD datasets respectively, down to 0.20 on 15-era classification with OpenITI and 0.18 on 12-era classification with APCD.
Conclusion: Neural networks can effectively classify Arabic texts by historical era, with better performance on broader classifications (binary) than more granular periodizations, demonstrating feasibility of automatic Arabic text periodization.
Abstract: The Arabic language has undergone notable transformations over time, including the emergence of new vocabulary, the obsolescence of others, and shifts in word usage. This evolution is evident in the distinction between the classical and modern Arabic eras. Although historians and linguists have partitioned Arabic literature into multiple eras, relatively little research has explored the automatic classification of Arabic texts by time period, particularly beyond the domain of poetry. This paper addresses this gap by employing neural networks and deep learning techniques to automatically classify Arabic texts into distinct eras and periods. The proposed models are evaluated using two datasets derived from two publicly available corpora, covering texts from the pre-Islamic to the modern era. The study examines class setups ranging from binary to 15-class classification and considers both predefined historical eras and custom periodizations. Results range from F1-scores of 0.83 and 0.79 on the binary-era classification task using the OpenITI and APCD datasets, respectively, to 0.20 on the 15-era classification task using OpenITI and 0.18 on the 12-era classification task using APCD.
[46] LLM-in-Sandbox Elicits General Agentic Intelligence
Daixuan Cheng, Shaohan Huang, Yuxian Gu, Huatong Song, Guoxin Chen, Li Dong, Wayne Xin Zhao, Ji-Rong Wen, Furu Wei
Main category: cs.CL
TL;DR: LLM-in-Sandbox enables LLMs to explore code sandboxes for non-code tasks, showing generalization across domains without training, and can be enhanced via reinforcement learning.
Details
Motivation: To enable LLMs to leverage code sandbox environments (virtual computers) to demonstrate and enhance general intelligence capabilities in non-code domains, allowing them to access external resources, handle long contexts, and execute scripts.Method: Two approaches: 1) Training-free LLM-in-Sandbox where strong LLMs spontaneously use sandbox capabilities; 2) LLM-in-Sandbox-RL using reinforcement learning with non-agentic data to train models for sandbox exploration.
Result: LLM-in-Sandbox achieves robust generalization across mathematics, physics, chemistry, biomedicine, long-context understanding, and instruction following in both training-free and post-trained settings.
Conclusion: The approach enables LLMs to effectively use code sandboxes for non-code tasks, demonstrates strong generalization capabilities, and has been open-sourced as a Python package for real-world deployment with analyzed computational efficiency.
Abstract: We introduce LLM-in-Sandbox, enabling LLMs to explore within a code sandbox (i.e., a virtual computer), to elicit general intelligence in non-code domains. We first demonstrate that strong LLMs, without additional training, exhibit generalization capabilities to leverage the code sandbox for non-code tasks. For example, LLMs spontaneously access external resources to acquire new knowledge, leverage the file system to handle long contexts, and execute scripts to satisfy formatting requirements. We further show that these agentic capabilities can be enhanced through LLM-in-Sandbox Reinforcement Learning (LLM-in-Sandbox-RL), which uses only non-agentic data to train models for sandbox exploration. Experiments demonstrate that LLM-in-Sandbox, in both training-free and post-trained settings, achieves robust generalization spanning mathematics, physics, chemistry, biomedicine, long-context understanding, and instruction following. Finally, we analyze LLM-in-Sandbox’s efficiency from computational and system perspectives, and open-source it as a Python package to facilitate real-world deployment.
[47] Paramanu: Compact and Competitive Monolingual Language Models for Low-Resource Morphologically Rich Indian Languages
Mitodru Niyogi, Eric Gaussier, Arnab Bhattacharya
Main category: cs.CL
TL;DR: PARAMANU is a family of small, affordable Indian-only language models trained from scratch on five major Indian languages, achieving strong performance despite small size and low budget.
Details
Motivation: Multilingual LLMs are expensive and suffer from English-centric bias, language imbalances, tokenizer oversegmentation for morphologically rich languages, and the curse of multilinguality. There's a need for affordable, specialized models for Indian languages.Method: Trained from scratch on open-source data for five Indian languages (Bengali, Hindi, Marathi, Tamil, Telugu) using single GPU under $1,000 budget. Developed morphology-aligned low-fertility tokenizers, interpolation-based method for token position indices in RoPE scaling, and created instruction-tuning datasets translated across languages.
Result: Despite small size (108M-367M parameters), PARAMANU achieves strong performance-efficiency tradeoff and outperforms most larger multilingual models across all five Indian languages.
Conclusion: PARAMANU demonstrates that affordable, specialized language models for Indian languages can be built with limited resources while achieving competitive performance, enabling under-resourced researchers to develop language-specific models.
Abstract: Multilingual large language models (LLMs) are expensive to pretrain and often suffer from imbalances across languages and datasets, English-centric bias, tokenizer oversegmentation for morphologically rich low-resource languages, and the curse of multilinguality. We introduce PARAMANU, the first family of Indian-only autoregressive language models trained from scratch on open-source language-specific data for the five most spoken Indian languages: Bengali, Hindi, Marathi, Tamil, and Telugu. All models are designed for affordability and are trained on a single GPU with a budget under $1,000, allowing under-resourced researchers to build competitive language models. To address low-resource challenges, we develop morphology-aligned, low-fertility tokenizers, propose an interpolation-based method for token position indices in RoPE based scaling to train longer sequences efficiently. We also create instruction-tuning datasets in Bangla that are translated to the other four languages. Despite their small size (108M-367M parameters), Paramanu achieves a strong performance-efficiency tradeoff and outperforms most larger multilingual models across all five languages. Our collection is available at https://huggingface.co/collections/mitodru/paramanu.
[48] Vision-Language Models Align with Human Neural Representations in Concept Processing
Anna Bavaresco, Marianne de Heer Kloots, Sandro Pezzelle, Raquel Fernández
Main category: cs.CL
TL;DR: VLMs align with human brain concept processing, but architecture matters - some learn genuinely human-like concepts while others are context-sensitive, with vision-language encoders outperforming generative models.
Details
Motivation: To systematically evaluate how different VLM architectures align with human brain concept processing, and understand the role of visual vs. textual context in this alignment.Method: Analyzed multiple VLMs with different integration strategies, compared with language-only models. Measured alignment between model representations and fMRI brain responses to concept words presented with either visual (pictures) or textual (sentences) context.
Result: VLMs outperform language-only models in both conditions. Some VLMs (LXMERT, IDEFICS2) learn genuinely human-like concepts during pretraining, while others are highly context-sensitive at inference. Vision-language encoders are more brain-aligned than newer generative VLMs.
Conclusion: VLMs do align with human neural representations in concept processing, but architectural differences matter significantly. The study provides insights into which VLM designs better capture human-like concept processing.
Abstract: Recent studies suggest that transformer-based vision-language models (VLMs) capture the multimodality of concept processing in the human brain. However, a systematic evaluation exploring different types of VLM architectures and the role played by visual and textual context is still lacking. Here, we analyse multiple VLMs employing different strategies to integrate visual and textual modalities, along with language-only counterparts. We measure the alignment between concept representations by models and existing (fMRI) brain responses to concept words presented in two experimental conditions, where either visual (pictures) or textual (sentences) context is provided. Our results reveal that VLMs outperform the language-only counterparts in both experimental conditions. However, controlled ablation studies show that only for some VLMs, such as LXMERT and IDEFICS2, brain alignment stems from genuinely learning more human-like concepts during pretraining, while others are highly sensitive to the context provided at inference. Additionally, we find that vision-language encoders are more brain-aligned than more recent, generative VLMs. Altogether, our study shows that VLMs align with human neural representations in concept processing, while highlighting differences among architectures. We open-source code and materials to reproduce our experiments at: https://github.com/dmg-illc/vl-concept-processing.
[49] Medal Matters: Probing LLMs’ Failure Cases Through Olympic Rankings
Juhwan Choi, Seunguk Yu, JungMin Yun, YoungBin Kim
Main category: cs.CL
TL;DR: LLMs can recall Olympic medal counts but struggle with ranking teams, revealing limitations in their internal knowledge organization compared to human reasoning.
Details
Motivation: To understand the internal knowledge structures of large language models (LLMs) and how they organize information, using historical Olympic medal data as a test case to examine differences between LLM knowledge organization and human reasoning.Method: Evaluated LLMs on two tasks using historical Olympic medal tallies: (1) retrieving medal counts for specific teams, and (2) identifying rankings of each team. Used state-of-the-art LLMs to test their performance on these related but distinct knowledge tasks.
Result: LLMs excel at recalling medal counts but struggle with providing rankings, showing a key difference between their knowledge organization and human reasoning. This reveals limitations in how LLMs integrate and organize internal knowledge.
Conclusion: The findings highlight limitations in LLMs’ internal knowledge integration and suggest directions for improvement. The researchers release code, dataset, and model outputs to facilitate further research on LLM knowledge structures.
Abstract: Large language models (LLMs) have achieved remarkable success in natural language processing tasks, yet their internal knowledge structures remain poorly understood. This study examines these structures through the lens of historical Olympic medal tallies, evaluating LLMs on two tasks: (1) retrieving medal counts for specific teams and (2) identifying rankings of each team. While state-of-the-art LLMs excel in recalling medal counts, they struggle with providing rankings, highlighting a key difference between their knowledge organization and human reasoning. These findings shed light on the limitations of LLMs’ internal knowledge integration and suggest directions for improvement. To facilitate further research, we release our code, dataset, and model outputs.
[50] NP-Hard Lower Bound Complexity for Semantic Self-Verification
Robin Young
Main category: cs.CL
TL;DR: Semantic Self-Verification (SSV) - verifying if a statement accurately characterizes its own semantic properties - is proven NP-complete via reduction from 3-SAT, showing computational barriers for AI safety approaches relying on semantic interpretation.
Details
Motivation: Addresses a core challenge in AI safety and fairness: can AI systems verify they have correctly interpreted rules governing their behavior? This is crucial for constitutional AI, alignment via natural language, and instruction-following systems.Method: Models SSV as determining whether a statement accurately characterizes its own semantic properties within an interpretive framework. Proves NP-completeness via polynomial-time reduction from 3-SAT, mapping 3-SAT formulas to SSV instances with ambiguous terms and semantic constraints.
Result: SSV is proven NP-complete, establishing computational barriers for even simplified forms of semantic self-verification. The reduction shows that verifying semantic interpretations faces fundamental computational complexity.
Conclusion: AI safety approaches relying on semantic interpretation of instructions face computational barriers due to NP-completeness of SSV. More realistic verification scenarios likely face even greater complexity, challenging methods like constitutional AI and alignment via natural language.
Abstract: We model Semantic Self-Verification (SSV) as the problem of determining whether a statement accurately characterizes its own semantic properties within a given interpretive framework that formalizes a challenge in AI safety and fairness: can an AI system verify that it has correctly interpreted rules intended to govern its behavior? We prove that SSV, in this specification, is NP-complete by constructing a polynomial-time reduction from 3-Satisfiability (3-SAT). Our reduction maps a 3-SAT formula to an instance of SSV involving ambiguous terms with binary interpretations and semantic constraints derived from logical clauses. This establishes that even simplified forms of semantic self-verification should face computational barriers. The NP-complete lower bound has implications for AI safety and fairness approaches that rely on semantic interpretation of instructions, including but not limited to constitutional AI, alignment via natural language, and instruction-following systems. Approaches where an AI system verify its understanding of directives may face this computational barrier. We argue that more realistic verification scenarios likely face even greater complexity.
[51] UniAttn: Reducing Inference Costs via Softmax Unification for Post-Training LLMs
Yizhe Xiong, Wei Huang, Xin Ye, Hui Chen, Zijia Lin, Haoran Lian, Zhenpeng Su, Jungong Han, Guiguang Ding
Main category: cs.CL
TL;DR: UniAttn reduces LLM inference costs by unifying Softmax activations across transformer blocks during post-training, addressing the bottleneck of Softmax operations while maintaining performance.
Details
Motivation: Post-training LLMs for real-world applications faces challenges with memory overhead and inference latency. Existing KV sharing methods still have high inference time overhead, and Softmax operations are identified as a primary bottleneck that is highly redundant during post-training.Method: Proposes UniAttn (Softmax Unification in Attention), which unifies Softmax activations across transformer blocks to reduce inference costs. Also uses a linear projection to compensate for errors induced by Softmax unification.
Result: UniAttn matches the performance of standard post-training while significantly reducing inference costs, outperforming existing efficient architectures (intra-layer and cross-layer KV sharing methods) during post-training.
Conclusion: UniAttn provides an effective post-training method that reduces LLM inference costs by addressing the Softmax bottleneck through activation unification, making it superior to existing efficient architectures for post-training pre-trained LLMs.
Abstract: Post-training is essential for adapting Large Language Models (LLMs) to real-world applications. Deploying post-trained models faces significant challenges due to substantial memory overhead and noticeable inference latency. Existing work has identified significant redundancies in LLMs and proposed efficient architectures, namely intra-layer KV sharing and cross-layer KV sharing. However, these methods still result in high inference time overhead, remaining suboptimal for post-training pre-trained LLMs. In this paper, we identify that the \texttt{Softmax} operation is a primary bottleneck for LLM inference and discover that it is actually highly redundant during post-training. We propose Softmax \textbf{Uni}fication in \textbf{Att}e\textbf{n}tion (\textbf{UniAttn}), a novel post-training method that unifies Softmax activations across transformer blocks to reduce LLM inference costs. Additionally, UniAttn adopts a linear projection to compensate for the errors induced by Softmax unification. Experiments show that UniAttn matches the performance of standard post-training while significantly reducing inference costs, outperforming existing efficient architectures during post-training.
[52] The exponential distribution of the order of demonstrative, numeral, adjective and noun
Ramon Ferrer-i-Cancho
Main category: cs.CL
TL;DR: The paper finds that word order distributions follow an exponential rather than power law distribution, challenging the inevitability of power laws like Zipf’s law, and suggests unattested orders result from undersampling rather than hard constraints.
Details
Motivation: To resolve the debate about whether word order distributions follow exponential or power law patterns, and to understand the nature of constraints on word order variation in noun phrases (demonstrative, numeral, adjective, noun).Method: Investigated the actual distribution of 24 possible word orders, comparing exponential vs power law models. Tested two exponential distributions: geometric distribution truncated at rank 24 vs right-truncated geometric distribution with variable number of possible orders.
Result: Exponential distribution provides much better fit than power law. Among exponential models, the geometric distribution where all 24 orders have non-zero probability shows higher support when prioritizing consistency and generalizability.
Conclusion: Word order variation follows exponential distribution, not power law, challenging the view that power laws are inevitable in language. Unattested orders likely result from undersampling rather than hard constraints, supporting Cysouw’s view.
Abstract: The frequency of the preferred order for a noun phrase formed by demonstrative, numeral, adjective and noun has received significant attention over the last two decades. We investigate the actual distribution of the 24 possible orders. There is no consensus on whether it is well-fitted by an exponential or a power law distribution. We find that an exponential distribution is a much better model. This finding and other circumstances where an exponential-like distribution is found challenge the view that power-law distributions, e.g., Zipf’s law for word frequencies, are inevitable. We also investigate which of two exponential distributions gives a better fit: an exponential model where the 24 orders have non-zero probability (a geometric distribution truncated at rank 24) or an exponential model where the number of orders that can have non-zero probability is variable (a right-truncated geometric distribution). When consistency and generalizability are prioritized, we find higher support for the exponential model where all 24 orders have non-zero probability. These findings strongly suggest that there is no hard constraint on word order variation and then unattested orders merely result from undersampling, consistently with Cysouw’s view.
[53] GENERator: A Long-Context Generative Genomic Foundation Model
Wei Wu, Qiuyi Li, Yuanyuan Zhang, Zhihao Zhan, Ruipu Chen, Mingyang Li, Kun Fu, Junyan Qi, Yongzhou Bao, Chao Wang, Yiheng Zhu, Zhiyun Zhang, Jian Tang, Fuli Feng, Jieping Ye, Yuwen Liu, Hui Xiong, Zheng Wang
Main category: cs.CL
TL;DR: GENErator is a generative genomic foundation model with 98k nucleotide context length, pre-trained on 386B nucleotides, showing strong zero-shot capabilities for variant prediction and generative applications for protein-coding sequences and cis-regulatory elements.
Details
Motivation: Existing genomic language models are limited by restricted training scope, constrained generative capability, or prohibitive computational costs, despite vast genomic datasets being available.Method: Developed GENErator, a generative genomic foundation model with 98k nucleotide context length, pre-trained on 386 billion nucleotides of eukaryotic DNA, enabling both zero-shot analysis and task-specific fine-tuning.
Result: Achieves competitive variant effect prediction without fine-tuning, shows phylogenetically coherent embeddings, generates structurally plausible protein-coding sequences, and designs functional cis-regulatory elements validated by UMI-STARR-seq assays.
Conclusion: GENErator establishes an efficient, biologically grounded framework for genomic interpretation and programmable sequence design, with broad applicability across species.
Abstract: The rapid advancement of DNA sequencing has produced vast genomic datasets, yet interpreting and engineering genomic function remain fundamental challenges. Recent large language models have opened new avenues for genomic analysis, but existing approaches are often limited by restricted training scope, constrained generative capability, or prohibitive computational cost. We introduce GENErator, a generative genomic foundation model for long-context DNA modeling, with a context length of 98k nucleotides, pre-trained on 386 billion nucleotides of eukaryotic DNA. Without task-specific fine-tuning, GENERator exhibits strong intrinsic capabilities: unsupervised embedding analyses reveal phylogenetically coherent structure, and sequence recovery benchmarks demonstrate generative accuracy comparable to or exceeding state-of-the-art models with substantially improved computational efficiency. In a zero-shot setting, GENERator achieves competitive variant effect prediction performance relative to alignment-based methods, while remaining fully alignment-free and broadly applicable across species. With task-specific fine-tuning, the model attains leading performance on established genomic benchmarks. We further demonstrate practical generative applications. GENERator can generate protein-coding DNA sequences that translate into structurally plausible proteins and, through a prompt-guided design framework, design cis-regulatory elements with targeted activity profiles, including synthetic super-enhancers validated by high-throughput UMI-STARR-seq assays. Together, these results establish GENERator as an efficient and biologically grounded framework for genomic interpretation and programmable sequence design. Code and supplementary resources are available at https://github.com/GenerTeam/GENERator.
[54] SCALAR: Scientific Citation-based Live Assessment of Long-context Academic Reasoning
Renxi Wang, Honglin Mu, Liqun Ma, Lizhi Lin, Yunlong Feng, Timothy Baldwin, Xudong Han, Haonan Li
Main category: cs.CL
TL;DR: SCALAR is a benchmark for evaluating citation-grounded long-context reasoning in academic papers, using automatic label generation and dynamic updates to prevent data contamination.
Details
Motivation: There's a need for better evaluation of long-context understanding in LLMs, particularly for academic writing where citation reasoning is crucial but current evaluation methods are inadequate.Method: SCALAR uses academic papers and their citation structures to automatically generate ground-truth labels without human annotation. It features controllable difficulty levels, dynamic updating, and includes two tasks: multiple-choice QA and cloze-style citation prediction.
Result: Human experts achieve over 90% accuracy on multiple-choice tasks, but most LLMs struggle. The cloze-style task is even harder, with no model exceeding 50% accuracy. The multiple-choice format effectively distinguishes model capabilities.
Conclusion: SCALAR provides a domain-grounded, continuously updating framework for tracking progress in citation-based long-context understanding, revealing significant gaps between human and current model performance.
Abstract: Long-context understanding has emerged as a critical capability for large language models (LLMs). However, evaluating this ability remains challenging. We present SCALAR, a benchmark designed to assess citation-grounded long-context reasoning in academic writing. SCALAR leverages academic papers and their citation structure to automatically generate high-quality ground-truth labels without human annotation. It features controllable difficulty levels and a dynamic updating mechanism that mitigates data contamination. The benchmark includes two tasks: a multiple-choice QA format and a cloze-style citation prediction. We evaluate a range of state-of-the-art LLMs and find that the multiple-choice task effectively distinguishes model capabilities. While human experts achieve over 90% accuracy, most models struggle. The cloze-style task is even more challenging, with no model exceeding 50% accuracy. SCALAR provides a domain-grounded, continuously updating framework for tracking progress in citation-based long-context understanding.
[55] I-MCTS: Enhancing Agentic AutoML via Introspective Monte Carlo Tree Search
Zujie Liang, Feng Wei, Wujiang Xu, Lin Chen, Yuxi Qian, Xinhui Wu
Main category: cs.CL
TL;DR: I-MCTS improves LLM-based AutoML agents by using introspective node expansion and LLM-based value models for better code generation diversity and quality.
Details
Motivation: Existing LLM-based AutoML agents suffer from low-diversity and suboptimal code generation, and current MCTS approaches have limitations in thought quality/diversity and scalar value feedback mechanisms.Method: Introduces Introspective Monte Carlo Tree Search (I-MCTS) with iterative node expansion through introspective analysis of parent/sibling solutions, LLM-based value models for node evaluation, and hybrid rewarding mechanism transitioning from LLM-estimated to actual performance scores.
Result: Achieves 4% absolute performance improvement over strong open-source AutoML agents across various ML tasks.
Conclusion: I-MCTS effectively enhances agentic AutoML systems by improving decision-making through introspective refinement and better node evaluation mechanisms.
Abstract: Recent advancements in large language models (LLMs) have shown remarkable potential in automating machine learning tasks. However, existing LLM-based agents often struggle with low-diversity and suboptimal code generation. While recent work has introduced Monte Carlo Tree Search (MCTS) to address these issues, limitations persist in the quality and diversity of thoughts generated, as well as in the scalar value feedback mechanisms used for node selection. In this study, we introduce Introspective Monte Carlo Tree Search (I-MCTS), a novel approach that iteratively expands tree nodes through an introspective process that meticulously analyzes solutions and results from parent and sibling nodes. This facilitates a continuous refinement of the node in the search tree, thereby enhancing the overall decision-making process. Furthermore, we integrate a Large Language Model (LLM)-based value model to facilitate direct evaluation of each node’s solution prior to conducting comprehensive computational rollouts. A hybrid rewarding mechanism is implemented to seamlessly transition the Q-value from LLM-estimated scores to actual performance scores. This allows higher-quality nodes to be traversed earlier. Applied to the various ML tasks, our approach demonstrates a 4% absolute improvement in performance compared to the strong open-source AutoML agents, showcasing its effectiveness in enhancing agentic AutoML systems. Resource available at https://github.com/jokieleung/I-MCTS
[56] English K_Quantization of LLMs Does Not Disproportionately Diminish Multilingual Performance
Karl Audun Borgersen, Morten Goodwin
Main category: cs.CL
TL;DR: Quantizing LLMs with different language importance matrices doesn’t significantly affect multilingual performance - English, Norwegian, and Malayalam matrices yield similar results on English and Norwegian tasks.
Details
Motivation: Current quantization practices use English-centric importance matrices, raising concerns about whether they preserve English performance at the expense of multilingual capabilities, and whether alternative language matrices could better preserve multilingual performance.Method: Quantized Llama3.3 70B using GGUF format with k-quantization on three different importance matrices (English, Norwegian, Malayalam), then evaluated on MixEval dataset in both English and Norwegian languages.
Result: All experiments yielded non-significant results - no statistically significant differences in performance between different language importance matrices, indicating current quantization practices don’t disproportionately harm multilingual performance.
Conclusion: Language choice in importance matrices doesn’t significantly impact multilingual performance in quantization, suggesting current English-centric practices are adequate and alternative language matrices don’t provide meaningful advantages.
Abstract: For consumer usage of locally deployed LLMs, the GGUF format and k_quantization are invaluable tools for maintaining the performance of the original model while reducing it to sizes deployable with consumer-grade hardware. The number of bits dedicated to each weight from the original model is reduced based on how important they are thought to be during model inference. This importance is arrived at through the application of an ‘importance matrix’-a relatively small text document meant to be representative of the LLM’s standard use-cases. In the vast majority of quants available online, this document is primarily written in English. It was therefore an open question whether performance on English language tasks was preserved through the sacrifice of multilingual performance and whether it can be preserved with alternate importance matrices. This article investigates these hypotheses by quantizing Llama3.3 70B on importance matrices written in three languages (English, Norwegian, and Malayalam) and evaluating them on the MixEval dataset in both English and Norwegian. All experiments related to yielded non-significant results indicating that current quantization practices do not disproportionately harm multilingual performance.
[57] Poor Alignment and Steerability of Large Language Models: Evidence from College Admission Essays
Jinsook Lee, AJ Alvero, Thorsten Joachims, René Kizilcec
Main category: cs.CL
TL;DR: LLM-generated college admission essays are linguistically distinct from human essays, and demographic prompting fails to align them with human writing patterns, raising concerns about LLM use in high-stakes contexts.
Details
Motivation: The study investigates two key questions about LLMs in formal communication: who LLMs write like (model alignment) and whether LLMs can be prompted to change who they write like (model steerability), particularly in the high-stakes context of college admissions where authenticity matters.Method: Researchers compared lexical and sentence variation between 30,000 human-authored undergraduate admission essays and two types of LLM-generated essays: one prompted only with the essay question, and another with additional demographic information about each applicant. They analyzed linguistic patterns across multiple models and approaches.
Result: Both types of LLM-generated essays were linguistically distinct from human-authored essays. Demographic prompting was remarkably ineffective in aligning models with linguistic patterns of specific identity groups (sex, race, first-generation status, geographic location). Demographically prompted and unprompted synthetic texts were more similar to each other than to human text.
Conclusion: Current LLMs show significant issues with model alignment and steerability, failing to authentically replicate human writing patterns even with demographic prompting. This raises serious concerns about using LLMs in high-stakes contexts like college admissions where authentic self-expression is crucial.
Abstract: People are increasingly using technologies equipped with large language models (LLM) to write texts for formal communication, which raises two important questions at the intersection of technology and society: Who do LLMs write like (model alignment); and can LLMs be prompted to change who they write like (model steerability). We investigate these questions in the high-stakes context of undergraduate admissions at a selective university by comparing lexical and sentence variation between essays written by 30,000 applicants to two types of LLM-generated essays: one prompted with only the essay question used by the human applicants; and another with additional demographic information about each applicant. We consistently find that both types of LLM-generated essays are linguistically distinct from human-authored essays, regardless of the specific model and analytical approach. Further, prompting a specific sociodemographic identity is remarkably ineffective in aligning the model with the linguistic patterns observed in human writing from this identity group. This holds along the key dimensions of sex, race, first-generation status, and geographic location. The demographically prompted and unprompted synthetic texts were also more similar to each other than to the human text, meaning that prompting did not alleviate homogenization. These issues of model alignment and steerability in current LLMs raise concerns about the use of LLMs in high-stakes contexts.
[58] RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale
Daniel Goldstein, Eric Alcaide, Janna Lu, Eugene Cheah
Main category: cs.CL
TL;DR: RADLADS is a protocol for converting softmax attention transformers into linear attention decoders with minimal training (350-700M tokens), achieving near-original quality at low cost (<$2K for 72B model).
Details
Motivation: To enable efficient conversion of existing transformer models to linear attention architectures with minimal computational cost while maintaining performance, making large linear attention models more accessible.Method: Developed a distillation protocol (RADLADS) that converts softmax attention transformers to linear attention decoders using only 350-700M tokens (<0.005% of original training). Created new RWKV-variant architectures and converted popular Qwen2.5 models (7B, 32B, 72B sizes).
Result: Models achieve state-of-the-art downstream performance for linear attention models of their size, with inference quality close to original transformers. The 72B conversion costs <$2,000 USD while maintaining near-original quality.
Conclusion: RADLADS enables cost-effective conversion of existing transformers to linear attention models with minimal training, making large-scale linear attention models practical and accessible. All models (except 72B) are released under Apache 2.0 license.
Abstract: We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along with two new RWKV-variant architectures, and models converted from popular Qwen2.5 open source models in 7B, 32B, and 72B sizes. Our conversion process requires only 350-700M tokens, less than 0.005% of the token count used to train the original teacher models. Converting to our 72B linear attention model costs less than $2,000 USD at today’s prices, yet quality at inference remains close to the original transformer. These models achieve state-of-the-art downstream performance across a set of standard benchmarks for linear attention models of their size. We release all our models on HuggingFace under the Apache 2.0 license, with the exception of our 72B models which are also governed by the Qwen License Agreement. Models at https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102 Training Code at https://github.com/recursal/RADLADS-paper
[59] Sense and Sensitivity: Examining the Influence of Semantic Recall on Long Context Code Reasoning
Adam Štorek, Mukur Gupta, Samira Hajizadeh, Prashast Srivastava, Suman Jana
Main category: cs.CL
TL;DR: LLMs excel at lexical recall (verbatim code retrieval) but fail at semantic recall (understanding operational semantics) in long contexts, especially when code is centrally positioned. Current benchmarks underestimate these failures due to pattern matching shortcuts.
Details
Motivation: To determine whether LLMs truly understand the operational semantics of long code contexts or merely rely on pattern matching shortcuts, distinguishing between lexical recall (verbatim retrieval) and semantic recall (understanding semantics).Method: Evaluated 10 state-of-the-art LLMs, introduced semantic recall sensitivity measurement, used counterfactual measurement to detect pattern matching shortcuts, and created a new task called SemTrace with unpredictable operations to test semantic understanding.
Result: Frontier models achieve near-perfect lexical recall (position-independent), but semantic recall degrades severely when code is centrally positioned in long contexts. LLMs rely heavily on pattern matching shortcuts in existing benchmarks. SemTrace shows median accuracy drops of 92.73% vs CRUXEval’s 53.36% when relevant code approaches the middle.
Conclusion: Current evaluations substantially underestimate semantic recall failures in long context code understanding. LLMs’ apparent competence in code understanding tasks may be largely due to pattern matching rather than genuine semantic understanding of operational semantics.
Abstract: Large language models (LLMs) are increasingly deployed for understanding large codebases, but whether they understand operational semantics of long code context or rely on pattern matching shortcuts remains unclear. We distinguish between lexical recall (retrieving code verbatim) and semantic recall (understanding operational semantics). Evaluating 10 state-of-the-art LLMs, we find that while frontier models achieve near-perfect, position-independent lexical recall, semantic recall degrades severely when code is centrally positioned in long contexts. We introduce semantic recall sensitivity to measure whether tasks require understanding of code’s operational semantics vs. permit pattern matching shortcuts. Through a novel counterfactual measurement method, we show that models rely heavily on pattern matching shortcuts to solve existing code understanding benchmarks. We propose a new task SemTrace, which achieves high semantic recall sensitivity through unpredictable operations; LLMs’ accuracy exhibits severe positional effects, with median accuracy drops of 92.73% versus CRUXEval’s 53.36% as the relevant code snippet approaches the middle of the input code context. Our findings suggest current evaluations substantially underestimate semantic recall failures in long context code understanding.
[60] NLP for Social Good: A Survey and Outlook of Challenges, Opportunities, and Responsible Deployment
Antonia Karamolegkou, Angana Borah, Eunjung Cho, Sagnik Ray Choudhury, Martina Galletti, Pranav Gupta, Oana Ignat, Priyanka Kargupta, Neema Kotonya, Hemank Lamba, Sun-Joo Lee, Arushi Mangla, Ishani Mondal, Fatima Zahra Moudakir, Deniz Nazarova, Poli Nemkova, Dina Pisarevskaya, Naquee Rizwan, Nazanin Sabri, Keenan Samway, Dominik Stammbach, Anna Steinberg, David Tomás, Steven R Wilson, Bowen Yi, Jessica H Zhu, Arkaitz Zubiaga, Anders Søgaard, Alexander Fraser, Zhijing Jin, Rada Mihalcea, Joel R. Tetreault, Daryna Dementieva
Main category: cs.CL
TL;DR: Survey of NLP for Social Good (NLP4SG) across nine global development domains, analyzing research trends and identifying gaps in poverty, peacebuilding, and environmental protection.
Details
Motivation: NLP has significant potential for positive social impact but remains underexplored in many critical areas of global development and social good.Method: Comprehensive survey of NLP4SG work across nine domains, analysis of ACL Anthology publication trends, and identification of principal tasks and challenges in each domain.
Result: Found that inclusion and AI harms receive the most research attention, while poverty, peacebuilding, and environmental protection remain significantly underexplored domains.
Conclusion: Calls for responsible and equitable NLP development, cross-disciplinary partnerships, and human-centered approaches to ensure NLP technologies advance the public good.
Abstract: Natural language processing (NLP) now shapes many aspects of our world, yet its potential for positive social impact is underexplored. This paper surveys work in ``NLP for Social Good" (NLP4SG) across nine domains relevant to global development and risk agendas, summarizing principal tasks and challenges. We analyze ACL Anthology trends, finding that inclusion and AI harms attract the most research, while domains such as poverty, peacebuilding, and environmental protection remain underexplored. Guided by our review, we outline opportunities for responsible and equitable NLP and conclude with a call for cross-disciplinary partnerships and human-centered approaches to ensure that future NLP technologies advance the public good.
[61] MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Dialogue Evaluators
John Mendonça, Alon Lavie, Isabel Trancoso
Main category: cs.CL
TL;DR: MEDAL is an automated multi-agent framework for creating diverse, multilingual dialogue evaluation benchmarks to better assess chatbot quality and LLM judges’ limitations.
Details
Motivation: Existing meta-evaluation benchmarks for open-domain chatbots are static, outdated, and lack multilingual coverage, limiting their ability to capture subtle weaknesses in evaluation.Method: Uses multiple LLMs to generate multilingual user-chatbot dialogues from varied seed contexts, then employs GPT-4.1 for multidimensional analysis to identify cross-lingual performance differences and curate a new human-annotated benchmark.
Result: Uncovered noticeable cross-lingual performance differences in chatbots and found that state-of-the-art LLM judges fail to reliably detect nuanced issues like lack of empathy, commonsense, or relevance.
Conclusion: MEDAL provides a more representative and diverse evaluation framework that reveals significant limitations in current LLM-based chatbot evaluation methods, particularly for detecting subtle quality issues across languages.
Abstract: Evaluating the quality of open-domain chatbots has become increasingly reliant on LLMs acting as automatic judges. However, existing meta-evaluation benchmarks are static, outdated, and lacking in multilingual coverage, limiting their ability to fully capture subtle weaknesses in evaluation. We introduce MEDAL, an automated multi-agent framework for curating more representative and diverse open-domain dialogue evaluation benchmarks. Our approach leverages several state-of-the-art LLMs to generate user-chatbot multilingual dialogues, conditioned on varied seed contexts. Then, a strong LLM (GPT-4.1) is used for a multidimensional analysis of the performance of the chatbots, uncovering noticeable cross-lingual performance differences. Guided by this large-scale evaluation, we curate a new meta-evaluation multilingual benchmark and human-annotate samples with nuanced quality judgments. This benchmark is then used to assess the ability of several reasoning and non-reasoning LLMs to act as evaluators of open-domain dialogues. Using MEDAL, we uncover that state-of-the-art judges fail to reliably detect nuanced issues such as lack of empathy, commonsense, or relevance.
[62] R-KV: Redundancy-aware KV Cache Compression for Reasoning Models
Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li-Wen Chang, Jiuxiang Gu, Zhen Dong, Anima Anandkumar, Abedelkadir Asi, Junjie Hu
Main category: cs.CL
TL;DR: R-KV is a redundancy-aware KV cache compression method for reasoning models that preserves near-full performance with only 10% cache, achieving 90% memory savings and 6.6X throughput.
Details
Motivation: Reasoning models produce excessively long outputs leading to prohibitively large KV caches during inference, and existing KV cache compression approaches often cause reasoning failures in chain-of-thought reasoning tasks.Method: Proposes Redundancy-aware KV Cache Compression for Reasoning models (R-KV), a novel method specifically targeting redundant tokens in reasoning models to compress the KV cache while preserving reasoning performance.
Result: R-KV preserves nearly 100% of full KV cache performance using only 10% of KV cache, outperforming baselines that reach only 60% performance. Achieves 105% performance with 16% cache, 90% memory saving, and 6.6X throughput over standard chain-of-thought inference.
Conclusion: R-KV consistently outperforms existing KV cache compression baselines across mathematical reasoning datasets, offering efficient KV cache compression specifically optimized for reasoning models without sacrificing performance.
Abstract: Reasoning models have demonstrated impressive performance in self-reflection and chain-of-thought reasoning. However, they often produce excessively long outputs, leading to prohibitively large key-value (KV) caches during inference. While chain-of-thought inference significantly improves performance on complex reasoning tasks, it can also lead to reasoning failures when deployed with existing KV cache compression approaches. To address this, we propose Redundancy-aware KV Cache Compression for Reasoning models (R-KV), a novel method specifically targeting redundant tokens in reasoning models. Our method preserves nearly 100% of the full KV cache performance using only 10% of the KV cache, substantially outperforming existing KV cache baselines, which reach only 60% of the performance. Remarkably, R-KV even achieves 105% of full KV cache performance with 16% of the KV cache. This KV-cache reduction also leads to a 90% memory saving and a 6.6X throughput over standard chain-of-thought reasoning inference. Experimental results show that R-KV consistently outperforms existing KV cache compression baselines across two mathematical reasoning datasets.
[63] Advances in LLMs with Focus on Reasoning, Adaptability, Efficiency and Ethics
Asifullah Khan, Muhammad Zaeem Khan, Saleha Jamshed, Sadia Ahmad, Aleesha Zainab, Kaynat Khatib, Faria Bibi, Abdul Rehman
Main category: cs.CL
TL;DR: This survey paper comprehensively reviews advancements in Large Language Models (LLMs), covering reasoning improvements, computational efficiency, ethical decision-making, and emerging applications like Agentic AI, while identifying current challenges and future research directions.
Details
Motivation: To provide a holistic overview of key developments in LLMs beyond isolated aspects, bridging the gap between human and machine communication, and addressing the need for more intelligent, efficient, and ethically-aligned language models.Method: The paper surveys and categorizes emerging methods including Chain-of-Thought prompting, Instruction Tuning, Reinforcement Learning from Human Feedback, multimodal learning, few-shot/zero-shot techniques, scaling strategies, optimization techniques, and Mixture-of-Experts architecture.
Result: The survey identifies effective techniques for enhancing LLM reasoning, adaptability, and efficiency, while also highlighting underexplored areas like interpretability, cross-modal integration, and sustainability, and persistent challenges including computational costs, biases, and ethical risks.
Conclusion: While significant progress has been made in LLM development, future research should focus on enhancing multimodal capabilities, improving safety and reliability, addressing bias mitigation, transparent decision-making, and establishing explicit ethical guidelines to overcome remaining challenges.
Abstract: This survey paper outlines the key developments in the field of Large Language Models (LLMs), including enhancements to their reasoning skills, adaptability to various tasks, increased computational efficiency, and the ability to make ethical decisions. The techniques that have been most effective in bridging the gap between human and machine communications include the Chain-of-Thought prompting, Instruction Tuning, and Reinforcement Learning from Human Feedback. The improvements in multimodal learning and few-shot or zero-shot techniques have further empowered LLMs to handle complex jobs with minor input. A significant focus is placed on efficiency, detailing scaling strategies, optimization techniques, and the influential Mixture-of-Experts (MoE) architecture, which strategically routes inputs to specialized subnetworks to boost predictive accuracy, while optimizing resource allocation. This survey also offers a broader perspective on recent advancements in LLMs, going beyond isolated aspects such as model architecture or ethical concerns. Additionally, it explores the role of LLMs in Agentic AI and their use as Autonomous Decision-Making Systems, and categorizes emerging methods that enhance LLM reasoning, efficiency, and ethical alignment. The survey also identifies underexplored areas such as interpretability, cross-modal integration, and sustainability. While significant advancements have been made in LLMs, challenges such as high computational costs, biases, and ethical risks remain. Overcoming these requires a focus on bias mitigation, transparent decision-making, and explicit ethical guidelines. Future research will generally focus on enhancing the model’s ability to handle multiple inputs, thereby making it more intelligent, safe, and reliable.
[64] SciArena: An Open Evaluation Platform for Non-Verifiable Scientific Literature-Grounded Tasks
Yilun Zhao, Kaiyan Zhang, Tiansheng Hu, Sihong Wu, Ronan Le Bras, Charles McGrady, Taira Anderson, Jonathan Bragg, Joseph Chee Chang, Jesse Dodge, Matt Latzke, Yixin Liu, Xiangru Tang, Zihang Wang, Chen Zhao, Hannaneh Hajishirzi, Doug Downey, Arman Cohan
Main category: cs.CL
TL;DR: SciArena is a community-driven platform for evaluating foundation models on scientific literature tasks using human voting, similar to Chatbot Arena, with 47 models and 20k+ votes, plus a meta-evaluation benchmark called SciArena-Eval.
Details
Motivation: Traditional benchmarks for scientific literature understanding are limited; there's a need for community-driven evaluation of foundation models on open-ended, literature-grounded tasks that require long-form responses.Method: Created an open collaborative platform where researchers vote on model comparisons for scientific tasks, collected 20,000+ votes across 47 models, and developed SciArena-Eval benchmark for automated evaluation by comparing model judgments with human votes.
Result: Platform successfully collected high-quality data, established model rankings, and revealed challenges in automated evaluation through the SciArena-Eval benchmark, showing current methods are not reliable enough.
Conclusion: SciArena provides valuable community-driven evaluation for scientific foundation models, while SciArena-Eval highlights the need for better automated evaluation methods for literature-grounded tasks.
Abstract: We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature-grounded tasks. Unlike traditional benchmarks for scientific literature understanding and synthesis, SciArena engages the research community directly, following the Chatbot Arena evaluation approach of community voting on model comparisons. By leveraging collective intelligence, SciArena offers a community-driven evaluation of model performance on open-ended scientific tasks that demand literature-grounded, long-form responses. The platform currently supports 47 foundation models and has collected over 20,000 votes from human researchers across diverse scientific domains. Our analysis of the data collected so far confirms its high quality. We discuss the results and insights based on the model ranking leaderboard. To further promote research in building model-based automated evaluation systems for literature tasks, we release SciArena-Eval, a meta-evaluation benchmark based on collected preference data. It measures the accuracy of models in judging answer quality by comparing their pairwise assessments with human votes. Our experiments highlight the benchmark’s challenges and emphasize the need for more reliable automated evaluation methods.
[65] DocPolarBERT: A Pre-trained Model for Document Understanding with Relative Polar Coordinate Encoding of Layout Structures
Benno Uthayasooriyar, Antoine Ly, Franck Vermet, Caio Corro
Main category: cs.CL
TL;DR: DocPolarBERT is a layout-aware BERT model that uses relative polar coordinates instead of absolute 2D positional embeddings for document understanding, achieving SOTA results with much less pre-training data.
Details
Motivation: The paper aims to develop a more efficient document understanding model that eliminates the need for absolute 2D positional embeddings, which can be computationally expensive and may not capture document layout relationships effectively.Method: Extends self-attention to consider text block positions in a relative polar coordinate system rather than Cartesian coordinates. This layout-aware approach allows the model to understand document structure without traditional absolute positional embeddings.
Result: DocPolarBERT achieves state-of-the-art results despite being pre-trained on a dataset more than six times smaller than the widely used IIT-CDIP corpus. The model demonstrates superior performance in document understanding tasks.
Conclusion: A carefully designed attention mechanism with relative polar coordinates can compensate for reduced pre-training data, offering an efficient and effective alternative for document understanding that reduces computational requirements while maintaining high performance.
Abstract: We introduce DocPolarBERT, a layout-aware BERT model for document understanding that eliminates the need for absolute 2D positional embeddings. We extend self-attention to take into account text block positions in relative polar coordinate system rather than the Cartesian one. Despite being pre-trained on a dataset more than six times smaller than the widely used IIT-CDIP corpus, DocPolarBERT achieves state-of-the-art results. These results demonstrate that a carefully designed attention mechanism can compensate for reduced pre-training data, offering an efficient and effective alternative for document understanding.
[66] ConlangCrafter: Constructing Languages with a Multi-Hop LLM Pipeline
Morris Alper, Moran Yanuka, Raja Giryes, Gašper Beguš
Main category: cs.CL
TL;DR: ConlangCrafter uses LLMs as computational creativity tools for end-to-end constructed language creation through a multi-stage pipeline with self-refinement feedback.
Details
Motivation: Constructed languages have important roles in art, philosophy, and international communication, but creating them requires linguistic expertise. Foundation models have shown strong creative generation capabilities, so the authors aim to leverage LLMs as computational creativity aids to democratize conlang creation.Method: ConlangCrafter is a multi-hop pipeline that decomposes language design into modular stages: phonology, morphology, syntax, lexicon generation, and translation. It uses LLMs’ metalinguistic reasoning capabilities, injects randomness for diversity, and employs self-refinement feedback to ensure consistency in the emerging language description.
Result: The authors developed a novel, scalable evaluation framework measuring consistency and typological diversity. Both automatic and manual evaluations demonstrate that ConlangCrafter can produce coherent and varied constructed languages without requiring human linguistic expertise.
Conclusion: LLMs can effectively serve as computational creativity tools for constructed language creation, enabling end-to-end conlang generation through structured pipelines that balance diversity and consistency, potentially democratizing language creation.
Abstract: Constructed languages (conlangs) such as Esperanto and Quenya have played diverse roles in art, philosophy, and international communication. Meanwhile, foundation models have revolutionized creative generation in text, images, and beyond. In this work, we leverage modern LLMs as computational creativity aids for end-to-end conlang creation. We introduce ConlangCrafter, a multi-hop pipeline that decomposes language design into modular stages – phonology, morphology, syntax, lexicon generation, and translation. At each stage, our method leverages LLMs’ metalinguistic reasoning capabilities, injecting randomness to encourage diversity and leveraging self-refinement feedback to encourage consistency in the emerging language description. We construct a novel, scalable evaluation framework for this task, evaluating metrics measuring consistency and typological diversity. Automatic and manual evaluations demonstrate ConlangCrafter’s ability to produce coherent and varied conlangs without human linguistic expertise.
[67] Being Kind Isn’t Always Being Safe: Diagnosing Affective Hallucination in LLMs
Sewon Kim, Jiwon Kim, Seungwoo Shin, Hyejin Chung, Daeun Moon, Yejin Kwon, Hyunsoo Yoon
Main category: cs.CL
TL;DR: The paper introduces “Affective Hallucination” - LLMs creating false emotional connections in mental health conversations, and provides AHaBench benchmark and AHaPairs dataset to address this safety concern.
Details
Motivation: LLMs are increasingly used in emotionally vulnerable conversations where they simulate empathy and affective tones, creating illusions of genuine relational connection despite lacking actual affective capacity. This poses psychological safety risks.Method: Created AHaBench (500 mental-health prompts with expert reference responses) evaluated on Emotional Enmeshment, Illusion of Presence, and Fostering Overdependence. Also developed AHaPairs (5K preference dataset) for DPO fine-tuning to align models with emotionally responsible behavior.
Result: DPO fine-tuning substantially reduces affective hallucination without compromising reasoning performance. Strong correlation (r=0.85) between GPT-4o and human judgments validates AHaBench as effective diagnostic tool.
Conclusion: Establishes affective hallucination as distinct safety concern and provides resources (benchmark, dataset, code) for developing LLMs that are both factually reliable and psychologically safe in mental health contexts.
Abstract: Large Language Models (LLMs) are increasingly engaged in emotionally vulnerable conversations that extend beyond information seeking to moments of personal distress. As they adopt affective tones and simulate empathy, they risk creating the illusion of genuine relational connection. We term this phenomenon Affective Hallucination, referring to emotionally immersive responses that evoke false social presence despite the model’s lack of affective capacity. To address this, we introduce AHaBench, a benchmark of 500 mental-health-related prompts with expert-informed reference responses, evaluated along three dimensions: Emotional Enmeshment, Illusion of Presence, and Fostering Overdependence. We further release AHaPairs, a 5K-instance preference dataset enabling Direct Preference Optimization (DPO) for alignment with emotionally responsible behavior. DPO fine-tuning substantially reduces affective hallucination without compromising reasoning performance, and the Pearson correlation coefficients between GPT-4o and human judgments is also strong (r=0.85) indicating that human evaluations confirm AHaBench as an effective diagnostic tool. This work establishes affective hallucination as a distinct safety concern and provides resources for developing LLMs that are both factually reliable and psychologically safe. AHaBench and AHaPairs are accessible via https://huggingface.co/datasets/o0oMiNGo0o/AHaBench, and code for fine-tuning and evaluation are in https://github.com/0oOMiNGOo0/AHaBench. Warning: This paper contains examples of mental health-related language that may be emotionally distressing.
[68] The Percept-V Challenge: Can Multimodal LLMs Crack Simple Perception Problems?
Samrajnee Ghosh, Naman Agarwal, Hemanshu Garg, Chinmay Mittal, Mausam, Parag Singla
Main category: cs.CL
TL;DR: Percept-V is a new benchmark dataset of 6000 program-generated images testing basic visual perception skills based on TVPS-4 framework. Surprisingly, state-of-the-art MLLMs perform poorly on these simple perception tasks compared to humans, especially as image complexity increases.
Details
Motivation: While cognitive science treats visual perception as a fundamental sign of intelligence, existing MLLM benchmarks focus on advanced reasoning and knowledge rather than basic perception skills. There's limited research evaluating MLLMs on simple visual perception comparable to human developmental abilities.Method: Created Percept-V dataset with 6000 program-generated uncontaminated images across 30 domains, each testing one or more TVPS-4 skills (like visual discrimination and form constancy). The tasks are designed to be simple with minimal reasoning/knowledge requirements. Evaluated both proprietary and open-source state-of-the-art MLLMs against human performance.
Result: MLLMs showed surprisingly weak performance compared to high human performance on Percept-V. Performance degrades rapidly as number of objects in images increases. The study identified specific perception skills that are particularly challenging for all models.
Conclusion: Despite MLLMs’ ability to solve complex tasks, they struggle with basic visual perception that humans master early in development. This reveals a significant gap in MLLMs’ fundamental perception capabilities and suggests perception should be a focus area for improvement in multimodal AI systems.
Abstract: Cognitive science research treats visual perception, the ability to understand and make sense of a visual input, as one of the early developmental signs of intelligence. Its TVPS-4 framework categorizes and tests human perception into seven skills such as visual discrimination, and form constancy. Do Multimodal Large Language Models (MLLMs) match up to humans in basic perception? Even though there are many benchmarks that evaluate MLLMs on advanced reasoning and knowledge skills, there is limited research that focuses evaluation on simple perception. In response, we introduce Percept-V, a dataset containing 6000 program-generated uncontaminated images divided into 30 domains, where each domain tests one or more TVPS-4 skills. Our focus is on perception, so we make our domains quite simple and the reasoning and knowledge required for solving them are minimal. Since modern-day MLLMs can solve much more complex tasks, our a-priori expectation is that they will solve these domains very easily. Contrary to our belief, our experiments show a weak performance of SoTA proprietary and open-source MLLMs compared to very high human performance on Percept-V. We find that as number of objects in the image increases, performance goes down rather fast. Our experiments also identify the perception skills that are considerably harder for all models.
[69] Is this chart lying to me? Automating the detection of misleading visualizations
Jonathan Tonglet, Jan Zimny, Tinne Tuytelaars, Iryna Gurevych
Main category: cs.CL
TL;DR: Researchers introduce Misviz, a benchmark of 2,604 real-world misleading visualizations annotated with 12 types of misleaders, plus Misviz-synth, a synthetic dataset of 57,665 visualizations for training. They evaluate state-of-the-art MLLMs and other systems, finding the task remains highly challenging.
Details
Motivation: Misleading visualizations are a potent driver of misinformation on social media and the web, violating chart design principles to distort data. Both humans and MLLMs are frequently deceived by such visualizations. Automatically detecting misleading visualizations and identifying specific design rule violations could help protect readers and reduce misinformation spread, but AI model training and evaluation has been limited by the absence of large, diverse, openly available datasets.Method: Created Misviz benchmark with 2,604 real-world visualizations annotated with 12 types of misleaders. Also created Misviz-synth, a synthetic dataset of 57,665 visualizations generated using Matplotlib based on real-world data tables for model training. Conducted comprehensive evaluation using state-of-the-art MLLMs, rule-based systems, and image-axis classifiers.
Result: The task of automatically detecting misleading visualizations remains highly challenging for current state-of-the-art models. The comprehensive evaluation revealed significant limitations in existing approaches despite the availability of the new benchmark datasets.
Conclusion: The researchers release Misviz, Misviz-synth, and accompanying code to address the lack of large, diverse datasets for training and evaluating AI models on misleading visualization detection. The benchmark demonstrates the difficulty of this task and provides resources for future research in combating visualization-based misinformation.
Abstract: Misleading visualizations are a potent driver of misinformation on social media and the web. By violating chart design principles, they distort data and lead readers to draw inaccurate conclusions. Prior work has shown that both humans and multimodal large language models (MLLMs) are frequently deceived by such visualizations. Automatically detecting misleading visualizations and identifying the specific design rules they violate could help protect readers and reduce the spread of misinformation. However, the training and evaluation of AI models has been limited by the absence of large, diverse, and openly available datasets. In this work, we introduce Misviz, a benchmark of 2,604 real-world visualizations annotated with 12 types of misleaders. To support model training, we also create Misviz-synth, a synthetic dataset of 57,665 visualizations generated using Matplotlib and based on real-world data tables. We perform a comprehensive evaluation on both datasets using state-of-the-art MLLMs, rule-based systems, and image-axis classifiers. Our results reveal that the task remains highly challenging. We release Misviz, Misviz-synth, and the accompanying code.
[70] Collaborate, Deliberate, Evaluate: How LLM Alignment Affects Coordinated Multi-Agent Outcomes
Abhijnan Nath, Carine Graff, Nikhil Krishnaswamy
Main category: cs.CL
TL;DR: LLM alignment methods designed for single-user settings fail in multi-party collaborations; intervention agents that encourage reflection outperform standard alignment approaches in group decision-making tasks.
Details
Motivation: As LLMs become AI collaborators in multi-party workflows, their coordination with humans and other AIs requires predictable behavior in multi-turn interactions. Current alignment methods are developed for simplified single-user settings and don't account for complex multi-party dynamics.Method: The paper uses the theoretical lens of modified-action MDPs to analyze alignment limitations, then introduces a roleplay simulation methodology where differently-aligned LLMs are deployed in collaborative task dialogues. Intervention agents are designed to insert themselves into group dialogues to encourage reflection rather than provide answers.
Result: Intervention agents robust to action modification significantly outperform common alignment baselines in supporting correct task outcomes. The study shows how different alignment methods affect group collaboration trajectories, belief alignment, and coordination.
Conclusion: Standard LLM alignment techniques developed for single-user settings are inadequate for multi-party collaborations. Effective AI collaborators in group settings require alignment approaches that account for long-horizon multi-party interaction dynamics, with intervention agents that encourage reflection proving particularly effective.
Abstract: As Large Language Models (LLMs) get integrated into diverse workflows, they are increasingly being regarded as “collaborators” with humans, and required to work in coordination with other AI systems. If such AI collaborators are to reliably coordinate their actions and behaviors with humans or other AIs, their properties and behaviors over multi-turn interactions must be known and predictable. This paper examines how different alignment methods affect LLM agents’ effectiveness as partners in multi-turn, multi-party collaborations. We study this question through the lens of intervention agents that insert themselves into group dialogues not to provide answers, but to encourage the collaborative group to slow down and reflect upon their reasoning for deliberative decision-making. Common alignment techniques are typically developed under simplified single-user settings and assume the optimality of the underlying token MDP. Using the theoretical lens of the modified-action MDP, we show how they do not account for the dynamics of long-horizon multi-party interactions. We present a novel roleplay simulation methodology, where we align LLMs according to different methods and then deploy them in collaborative task dialogues to quantify how interventions affect the trajectory of group collaboration, belief alignment, and coordination. Our results show that an intervention agent that is robust to action modification significantly outperforms common alignment baselines in supporting correct task outcomes.
[71] SPOT: An Annotated French Corpus and Benchmark for Detecting Critical Interventions in Online Conversations
Manon Berriche, Célia Nouri, Chloé Clavel, Jean-Philippe Cointet
Main category: cs.CL
TL;DR: SPOT introduces the first annotated corpus for detecting “stopping points” in online discussions - subtle interventions that pause or redirect conversations, operationalized as a binary classification task on French Facebook comments.
Details
Motivation: Existing frameworks like counterspeech or social correction often overlook subtle, ordinary critical interventions in online discussions. The authors aim to translate the sociological concept of "stopping points" into a reproducible NLP task to address this gap.Method: Created SPOT corpus with 43,305 manually annotated French Facebook comments linked to false information URLs. Developed reliable annotation guidelines, benchmarked fine-tuned CamemBERT encoder models and instruction-tuned LLMs with various prompting strategies, and incorporated contextual metadata.
Result: Fine-tuned encoders outperformed prompted LLMs by more than 10 percentage points in F1 score. Incorporating contextual metadata improved encoder F1 scores from 0.75 to 0.78. The dataset, guidelines, and code are released for transparency.
Conclusion: Supervised learning remains crucial for emerging non-English social media tasks like stopping point detection. The SPOT corpus enables reproducible research on subtle conversational interventions that existing frameworks overlook.
Abstract: We introduce SPOT (Stopping Points in Online Threads), the first annotated corpus translating the sociological concept of stopping point into a reproducible NLP task. Stopping points are ordinary critical interventions that pause or redirect online discussions through a range of forms (irony, subtle doubt or fragmentary arguments) that frameworks like counterspeech or social correction often overlook. We operationalize this concept as a binary classification task and provide reliable annotation guidelines. The corpus contains 43,305 manually annotated French Facebook comments linked to URLs flagged as false information by social media users, enriched with contextual metadata (article, post, parent comment, page or group, and source). We benchmark fine-tuned encoder models (CamemBERT) and instruction-tuned LLMs under various prompting strategies. Results show that fine-tuned encoders outperform prompted LLMs in F1 score by more than 10 percentage points, confirming the importance of supervised learning for emerging non-English social media tasks. Incorporating contextual metadata further improves encoder models F1 scores from 0.75 to 0.78. We release the anonymized dataset, along with the annotation guidelines and code in our code repository, to foster transparency and reproducible research.
[72] Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling
Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, Song Han
Main category: cs.CL
TL;DR: 4/6 is a modification to block-scaled NVFP4 quantization that reduces quantization error by adaptively scaling some blocks to smaller FP4 values, making the distribution of representable values more uniform.
Details
Motivation: Low-precision formats like NVFP4 improve speed and reduce memory usage for large language models, but quantization degrades model performance due to precision loss. The non-uniform step sizes in floating point formats create larger quantization error on larger values.Method: Four Over Six (4/6) modifies block-scaled NVFP4 quantization by adaptively scaling some blocks to smaller FP4 values. This makes the distribution of representable values more uniform and reduces quantization error for near-maximal values. The method can be efficiently implemented on NVIDIA Blackwell GPUs.
Result: 4/6 brings training loss closer to BF16 compared to models trained with current state-of-the-art NVFP4 training recipes. It achieves performance gains during both pre-training and inference with minimal computational overhead, as demonstrated with the Nemotron 3 Nano 30B-A3B model architecture.
Conclusion: 4/6 effectively addresses the quantization error problem in NVFP4 by making the distribution of representable values more uniform through adaptive scaling, enabling better model performance while maintaining the efficiency benefits of low-precision formats.
Abstract: As large language models have grown larger, interest has grown in low-precision numerical formats such as NVFP4 as a way to improve speed and reduce memory usage. However, quantizing models to NVFP4 remains difficult as the lack of precision generally degrades model performance. In this work, we address this issue with Four Over Six (4/6), a modification to the block-scaled NVFP4 quantization algorithm that yields reduced quantization error. Unlike integer formats, floating point formats have non-uniform step sizes which create larger quantization error on larger values. 4/6 takes advantage of this by adaptively scaling some blocks to smaller FP4 values, making the distribution of representable values more uniform and reducing quantization error for near-maximal values. We show that 4/6 can be implemented efficiently on NVIDIA Blackwell GPUs, resulting in performance gains during both pre-training and inference with minimal computational overhead. In pre-training experiments with the Nemotron 3 Nano 30B-A3B model architecture, we find that 4/6 brings training loss closer to BF16 compared to models trained with current state-of-the-art NVFP4 training recipes. Our code is available at http://github.com/mit-han-lab/fouroversix.
[73] Do You Feel Comfortable? Detecting Hidden Conversational Escalation in AI Chatbots
Jihyung Park, Saleh Afroogh, David Atkinson, Junfeng Jiao
Main category: cs.CL
TL;DR: GAUGE is a logit-based framework for real-time detection of hidden conversational escalation in LLMs, addressing implicit harm from emotional reinforcement that traditional toxicity filters miss.
Details
Motivation: LLMs are increasingly used as emotional companions, but repeated emotional reinforcement can cause implicit harm through affective drift. Traditional toxicity filters fail to detect this subtle escalation, and existing guardrails using external classifiers or clinical rubrics lag behind real-time conversational dynamics.Method: GAUGE (Guarding Affective Utterance Generation Escalation) is a logit-based framework that measures how an LLM’s output probabilistically shifts the affective state of a dialogue in real-time.
Result: The paper proposes a novel framework for detecting hidden conversational escalation that traditional methods miss, enabling real-time monitoring of affective state shifts in LLM conversations.
Conclusion: GAUGE addresses a critical gap in LLM safety by providing real-time detection of implicit emotional harm through probabilistic measurement of affective state shifts, offering a more nuanced approach than traditional toxicity filters.
Abstract: Large Language Models (LLM) are increasingly integrated into everyday interactions, serving not only as information assistants but also as emotional companions. Even in the absence of explicit toxicity, repeated emotional reinforcement or affective drift can gradually escalate distress in a form of \textit{implicit harm} that traditional toxicity filters fail to detect. Existing guardrail mechanisms often rely on external classifiers or clinical rubrics that may lag behind the nuanced, real-time dynamics of a developing conversation. To address this gap, we propose GAUGE (Guarding Affective Utterance Generation Escalation), logit-based framework for the real-time detection of hidden conversational escalation. GAUGE measures how an LLM’s output probabilistically shifts the affective state of a dialogue.
[74] Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics
Iker García-Ferrero, David Montero, Roman Orus
Main category: cs.CL
TL;DR: Refusal Steering is an inference-time method that uses LLM-as-a-judge to assign refusal confidence scores and ridge-regularized steering vectors to control LLM refusal behavior on politically sensitive topics without retraining.
Details
Motivation: To achieve fine-grained control over LLM refusal behavior on politically sensitive topics without requiring retraining, replacing fragile pattern-based refusal detection with more robust methods.Method: Uses LLM-as-a-judge to assign refusal confidence scores, proposes ridge-regularized variant to compute steering vectors that better isolate refusal-compliance direction, applies activation steering at inference time.
Result: Successfully removes refusal behavior on politically sensitive topics while maintaining safety on JailbreakBench and near-baseline performance on general benchmarks. Works across 4B and 80B models, can also induce targeted refusals when desired. Refusal signals concentrate in deeper transformer layers and are distributed across many dimensions.
Conclusion: Activation steering can effectively remove political refusal behavior while retaining safety alignment for harmful content, offering a practical path to controllable, transparent moderation at inference time.
Abstract: We introduce Refusal Steering, an inference-time method to exercise fine-grained control over Large Language Models refusal behaviour on politically sensitive topics without retraining. We replace fragile pattern-based refusal detection with an LLM-as-a-judge that assigns refusal confidence scores and we propose a ridge-regularized variant to compute steering vectors that better isolate the refusal–compliance direction. On Qwen3-Next-80B-A3B-Thinking, our method removes the refusal behaviour of the model around politically sensitive topics while maintaining safety on JailbreakBench and near-baseline performance on general benchmarks. The approach generalizes across 4B and 80B models and can also induce targeted refusals when desired. We analize the steering vectors and show that refusal signals concentrate in deeper layers of the transformer and are distributed across many dimensions. Together, these results demonstrate that activation steering can remove political refusal behaviour while retaining safety alignment for harmful content, offering a practical path to controllable, transparent moderation at inference time.
[75] FLEx: Language Modeling with Few-shot Language Explanations
Adar Avsian, Christopher Richardson, Anirudh Sundar, Larry Heck
Main category: cs.CL
TL;DR: FLEx improves language model performance using few-shot explanations without weight updates, reducing CoT errors by up to 83%.
Details
Motivation: Language models still make repeated mistakes across related queries, and collecting natural language explanations at scale is infeasible, especially in expert domains.Method: FLEx selects representative model errors via embedding-based clustering, verifies explanations correct those errors, and summarizes them into a prompt prefix for inference-time guidance without weight modification.
Result: FLEx consistently outperforms chain-of-thought prompting across CounterBench, GSM8K, and ReasonIF datasets, reducing up to 83% of CoT’s remaining errors.
Conclusion: FLEx effectively improves model behavior using few explanatory examples, providing a practical solution for error correction without requiring extensive annotation or model retraining.
Abstract: Language models have become effective at a wide range of tasks, from math problem solving to open-domain question answering. However, they still make mistakes, and these mistakes are often repeated across related queries. Natural language explanations can help correct these errors, but collecting them at scale may be infeasible, particularly in domains where expert annotators are required. To address this issue, we introduce FLEx ($\textbf{F}$ew-shot $\textbf{L}$anguage $\textbf{Ex}$planations), a method for improving model behavior using a small number of explanatory examples. FLEx selects representative model errors using embedding-based clustering, verifies that the associated explanations correct those errors, and summarizes them into a prompt prefix that is prepended at inference-time. This summary guides the model to avoid similar errors on new inputs, without modifying model weights. We evaluate FLEx on CounterBench, GSM8K, and ReasonIF. We find that FLEx consistently outperforms chain-of-thought (CoT) prompting across all three datasets and reduces up to 83% of CoT’s remaining errors.
[76] TeleMem: Building Long-Term and Multimodal Memory for Agentic AI
Chunliang Chen, Ming Guan, Xiao Lin, Jiaxu Li, Luxi Lin, Qiyi Wang, Xiangyu Chen, Jixiang Luo, Changzhi Sun, Dell Zhang, Xuelong Li
Main category: cs.CL
TL;DR: TeleMem is a unified long-term multimodal memory system that improves LLM performance in extended dialogues by using narrative dynamic extraction, structured writing pipelines, and multimodal reasoning with ReAct-style loops.
Details
Motivation: Current LLMs struggle with long-term interactions due to limited attention over extended dialogue histories. RAG helps but has problems: unreliable memory updates, schema-driven hallucinations, inefficient write operations, and poor multimodal reasoning support.Method: 1) Narrative dynamic extraction to maintain coherent user profiles with dialogue-grounded information only. 2) Structured writing pipeline that batches, retrieves, clusters, and consolidates memory entries. 3) Multimodal memory module with ReAct-style reasoning (observe, think, act) for video understanding in long-term contexts.
Result: TeleMem outperforms state-of-the-art Mem0 baseline with 19% higher accuracy, 43% fewer tokens, and 2.1x speedup on the ZH-4O long-term role-play gaming benchmark.
Conclusion: TeleMem effectively addresses limitations of current memory systems for LLMs by providing efficient, accurate, and multimodal long-term memory capabilities that improve performance in extended dialogue scenarios.
Abstract: Large language models (LLMs) excel at many NLP tasks but struggle to sustain long-term interactions due to limited attention over extended dialogue histories. Retrieval-augmented generation (RAG) mitigates this issue but lacks reliable mechanisms for updating or refining stored memories, leading to schema-driven hallucinations, inefficient write operations, and minimal support for multimodal reasoning.To address these challenges, we propose TeleMem, a unified long-term and multimodal memory system that maintains coherent user profiles through narrative dynamic extraction, ensuring that only dialogue-grounded information is preserved. TeleMem further introduces a structured writing pipeline that batches, retrieves, clusters, and consolidates memory entries, substantially improving storage efficiency, reducing token usage, and accelerating memory operations. Additionally, a multimodal memory module combined with ReAct-style reasoning equips the system with a closed-loop observe, think, and act process that enables accurate understanding of complex video content in long-term contexts. Experimental results show that TeleMem surpasses the state-of-the-art Mem0 baseline with 19% higher accuracy, 43% fewer tokens, and a 2.1x speedup on the ZH-4O long-term role-play gaming benchmark.
[77] Attention Projection Mixing with Exogenous Anchors
Jonathan Su
Main category: cs.CL
TL;DR: ExoFormer solves the “first-layer tension” in cross-layer attention reuse by learning external anchor projections, improving optimization and data efficiency while preserving token identity.
Details
Motivation: Cross-layer reuse of early attention projections creates a structural conflict where the first layer must serve dual roles: as a stable reusable anchor for deeper layers and as an effective computational block. This "first-layer tension" limits the effectiveness of internal-anchor designs.Method: ExoFormer learns exogenous anchor projections outside the sequential layer stack, decoupling the anchor role from computational refinement. It uses a unified normalized mixing framework that mixes queries, keys, values, and gate logits with learnable coefficients at different granularities (elementwise/headwise/scalar), with normalization of anchor sources being key to stable reuse.
Result: ExoFormer variants consistently outperform internal-anchor counterparts. The dynamic variant achieves 1.5 downstream accuracy points improvement while matching validation loss using 1.5x fewer tokens than Gated Attention. External anchors preserve essential token identity, allowing layers to specialize exclusively in refinement.
Conclusion: ExoFormer resolves the first-layer tension through external anchor projections, enabling more effective cross-layer reuse. The Offloading Hypothesis explains its efficacy: external anchors preserve token identity while allowing layers to specialize in refinement. The approach improves both optimization and data efficiency.
Abstract: Cross-layer reuse of early attention projections can improve optimization and data efficiency, but it creates a structural conflict: the first layer must simultaneously act as a stable, reusable anchor for all deeper layers and as an effective computational block. We show this ‘‘first-layer tension’’ is a hidden limiter of internal-anchor designs. We propose ExoFormer, which resolves the conflict by learning exogenous anchor projections outside the sequential layer stack, decoupling the anchor role from computational refinement. We introduce a unified normalized mixing framework that mixes queries, keys, values, and gate logits using learnable coefficients (exploring coefficient granularities: elementwise/headwise/scalar), and we show that normalizing anchor sources is key to stable reuse. ExoFormer variants consistently outperform their internal-anchor counterparts, and the dynamic variant yields 1.5 downstream accuracy points while matching validation loss using 1.5x fewer tokens than Gated Attention. We explain this efficacy via an Offloading Hypothesis: external anchors preserve essential token identity, allowing layers to specialize exclusively in refinement. We release code and models to facilitate future research.
[78] LLMs Got Rhythm? Hybrid Phonological Filtering for Greek Poetry Rhyme Detection and Generation
Stergios Chatzikyriakidis, Anastasia Natsina
Main category: cs.CL
TL;DR: LLMs struggle with phonological tasks like rhyme in low-resource languages like Greek. A hybrid system combining LLMs with phonological algorithms achieves accurate rhyme identification/generation, with verification loops dramatically improving performance from <4% to 73.1% valid poems.
Details
Motivation: LLMs have remarkable NLP capabilities but struggle with phonologically-grounded phenomena like rhyme detection and generation, especially in lower-resource languages such as Modern Greek. There's a need to address this gap in phonological reasoning.Method: Hybrid system combining LLMs with deterministic phonological algorithms. Implements comprehensive taxonomy of Greek rhyme types (Pure, Rich, Imperfect, Mosaic, IDV patterns). Uses agentic generation pipeline with phonological verification. Evaluates multiple prompting strategies (zero-shot, few-shot, Chain-of-Thought, RAG-augmented) across various LLMs including Claude, GPT-4o, Gemini, and open-weight models.
Result: Significant “Reasoning Gap” observed: native-like models (Claude 3.7) perform intuitively (40% accuracy), while reasoning-heavy models (Claude 4.5) achieve state-of-the-art performance (54%) with Chain-of-Thought prompting. Pure LLM generation fails catastrophically (<4% valid poems), but hybrid verification loop restores performance to 73.1%. System and corpus of 40,000+ rhymes released.
Conclusion: Hybrid approach combining LLMs with phonological algorithms is essential for accurate rhyme processing in low-resource languages. Pure LLM approaches fail for phonological tasks, but verification mechanisms can dramatically improve performance. The released system and corpus support future research in phonological NLP.
Abstract: Large Language Models (LLMs), despite their remarkable capabilities across NLP tasks, struggle with phonologically-grounded phenomena like rhyme detection and generation. This is even more evident in lower-resource languages such as Modern Greek. In this paper, we present a hybrid system that combines LLMs with deterministic phonological algorithms to achieve accurate rhyme identification/analysis and generation. Our approach implements a comprehensive taxonomy of Greek rhyme types, including Pure, Rich, Imperfect, Mosaic, and Identical Pre-rhyme Vowel (IDV) patterns, and employs an agentic generation pipeline with phonological verification. We evaluate multiple prompting strategies (zero-shot, few-shot, Chain-of-Thought, and RAG-augmented) across several LLMs including Claude 3.7 and 4.5, GPT-4o, Gemini 2.0 and open-weight models like Llama 3.1 8B and 70B and Mistral Large. Results reveal a significant “Reasoning Gap”: while native-like models (Claude 3.7) perform intuitively (40% accuracy in identification), reasoning-heavy models (Claude 4.5) achieve state-of-the-art performance (54%) only when prompted with Chain-of-Thought. Most critically, pure LLM generation fails catastrophically (under 4% valid poems), while our hybrid verification loop restores performance to 73.1%. We release our system and a corpus of 40,000+ rhymes, derived from the Anemoskala and Interwar Poetry corpora, to support future research.
[79] From Interpretability to Performance: Optimizing Retrieval Heads for Long-Context Language Models
Youmi Ma, Naoaki Okazaki
Main category: cs.CL
TL;DR: RetMask improves long-context LLM performance by masking retrieval heads during training, achieving +2.28 points on HELMET at 128K for Llama-3.1 with substantial gains on citation generation and passage re-ranking.
Details
Motivation: While retrieval heads have been identified in LLMs for context information retrieval, their role in enhancing model performance remains unexplored. The paper investigates whether these mechanistic insights can be leveraged to improve long-context capabilities.Method: Proposes RetMask, a method that generates training signals by contrasting normal model outputs with those from an ablated variant where retrieval heads are masked. This mechanism-based approach creates a training signal that enhances the model’s ability to utilize retrieval heads effectively.
Result: Achieves substantial improvements: +2.28 points on HELMET at 128K for Llama-3.1, with +70% gains on generation with citation and +32% on passage re-ranking, while preserving performance on general tasks. Effectiveness depends on retrieval head organization - models with concentrated patterns respond strongly while distributed patterns show limited gains.
Conclusion: The mechanistic relationship validates the function of retrieval heads and demonstrates that mechanistic insights from interpretability research can be transformed into practical performance enhancements for LLMs, particularly for long-context tasks.
Abstract: Advances in mechanistic interpretability have identified special attention heads, known as retrieval heads, that are responsible for retrieving information from the context. However, the role of these retrieval heads in improving model performance remains unexplored. This work investigates whether retrieval heads can be leveraged to enhance the long-context capabilities of LLMs. Specifically, we propose RetMask, a method that generates training signals by contrasting normal model outputs with those from an ablated variant in which the retrieval heads are masked. This mechanism-based approach achieves substantial improvements: +2.28 points on HELMET at 128K for Llama-3.1, with +70% gains on generation with citation and +32% on passage re-ranking, while preserving performance on general tasks. Experiments across three model families reveal that the effectiveness depends on retrieval head organization: models with concentrated patterns of retrieval heads respond strongly, while those with distributed patterns show limited gains. This mechanistic relationship validates the function of retrieval heads and demonstrates that mechanistic insights can be transformed into performance enhancements.
[80] Beyond Tokens: Concept-Level Training Objectives for LLMs
Laya Iyer, Pranav Somani, Alice Guo, Dan Jurafsky, Chen Shani
Main category: cs.CL
TL;DR: The paper proposes shifting from token-level prediction to concept-level prediction in LLM training, where concepts group semantically equivalent surface forms, leading to better alignment with human semantic abstractions.
Details
Motivation: The next-token prediction (NTP) objective penalizes valid alternative continuations that are semantically equivalent (e.g., "mom" vs "mother"), biasing models toward surface form rather than underlying meaning. This mismatch between training signal and semantic correctness motivates higher-level learning objectives.Method: Proposes concept-level prediction where concepts group multiple surface forms of the same idea. Introduces various methods for integrating conceptual supervision into LLM training, moving beyond token-level prediction to operate over higher-level representations.
Result: Concept-aware models achieve lower perplexity, improved robustness under domain shift, and stronger performance than NTP-based models on diverse NLP benchmarks.
Conclusion: Concept-level supervision serves as an improved training signal that better aligns LLMs with human semantic abstractions, suggesting a promising direction for future LLM development beyond token-level prediction.
Abstract: The next-token prediction (NTP) objective has been foundational in the development of modern large language models (LLMs), driving advances in fluency and generalization. However, NTP operates at the \textit{token} level, treating deviations from a single reference continuation as errors even when alternative continuations are equally plausible or semantically equivalent (e.g., mom'' vs. mother’’). As a result, token-level loss can penalize valid abstractions, paraphrases, or conceptually correct reasoning paths, biasing models toward surface form rather than underlying meaning. This mismatch between the training signal and semantic correctness motivates learning objectives that operate over higher-level representations. We propose a shift from token-level to concept-level prediction, where concepts group multiple surface forms of the same idea (e.g., mom,'' mommy,’’ ``mother’’ $\rightarrow$ \textit{MOTHER}). We introduce various methods for integrating conceptual supervision into LLM training and show that concept-aware models achieve lower perplexity, improved robustness under domain shift, and stronger performance than NTP-based models on diverse NLP benchmarks. This suggests \textit{concept-level supervision} as an improved training signal that better aligns LLMs with human semantic abstractions.
[81] Codebook-Injected Dialogue Segmentation for Multi-Utterance Constructs Annotation: LLM-Assisted and Gold-Label-Free Evaluation
Jinsook Lee, Kirk Vanacore, Zhuqian Zhou, Bakhtawar Ahtisham, Jeanine Grutter, Rene F. Kizilcec
Main category: cs.CL
TL;DR: DA annotation often suffers from boundary disagreements despite action agreement. Codebook-injected segmentation conditions boundaries on annotation criteria, with LLM-based segmenters evaluated against baselines using new metrics. No single segmenter dominates - segmentation should be optimized for downstream objectives.
Details
Motivation: Traditional Dialogue Act annotation treats intent as localized to individual utterances, leading to annotator disagreements on segment boundaries despite agreement on underlying actions. This reduces apparent reliability and highlights the need for better segmentation approaches that align with annotation criteria.Method: Proposed codebook-injected segmentation that conditions boundary decisions on downstream annotation criteria. Evaluated LLM-based segmenters against standard and retrieval-augmented baselines. Introduced new evaluation metrics for span consistency, distinctiveness, and human-AI distributional agreement without requiring gold labels.
Result: DA-aware segmentation produces more internally consistent segments than text-only baselines. LLMs excel at creating construct-consistent spans, but coherence-based baselines remain superior at detecting global dialogue flow shifts. No single segmenter dominates across two datasets, with trade-offs between within-segment coherence, boundary distinctiveness, and human-AI distributional agreement.
Conclusion: Segmentation is a consequential design choice that should be optimized for specific downstream objectives rather than a single performance score. The trade-offs between different segmentation qualities suggest practitioners should select segmentation approaches based on their particular annotation needs and goals.
Abstract: Dialogue Act (DA) annotation typically treats communicative or pedagogical intent as localized to individual utterances or turns. This leads annotators to agree on the underlying action while disagreeing on segment boundaries, reducing apparent reliability. We propose codebook-injected segmentation, which conditions boundary decisions on downstream annotation criteria, and evaluate LLM-based segmenters against standard and retrieval-augmented baselines. To assess these without gold labels, we introduce evaluation metrics for span consistency, distinctiveness, and human-AI distributional agreement. We found DA-awareness produces segments that are internally more consistent than text-only baselines. While LLMs excel at creating construct-consistent spans, coherence-based baselines remain superior at detecting global shifts in dialogue flow. Across two datasets, no single segmenter dominates. Improvements in within-segment coherence frequently trade off against boundary distinctiveness and human-AI distributional agreement. These results highlight segmentation as a consequential design choice that should be optimized for downstream objectives rather than a single performance score.
[82] Knowing When to Abstain: Medical LLMs Under Clinical Uncertainty
Sravanthi Machcha, Sushrita Yerra, Sahil Gupta, Aishwarya Sahoo, Sharmin Sultana, Hong Yu, Zonghai Yao
Main category: cs.CL
TL;DR: MedAbstain is a benchmark for evaluating LLMs’ ability to abstain from answering medical multiple-choice questions when uncertain, revealing that even high-accuracy models often fail to abstain appropriately.
Details
Motivation: Current LLM evaluation focuses too much on accuracy, but in safety-critical medical applications, the ability to abstain when uncertain is crucial for trustworthy deployment. There's a need to systematically evaluate abstention capabilities in medical decision-making scenarios.Method: MedAbstain benchmark integrates conformal prediction, adversarial question perturbations, and explicit abstention options. It evaluates both open- and closed-source LLMs in medical multiple-choice question answering, which generalizes to agentic action selection.
Result: Even state-of-the-art, high-accuracy LLMs often fail to abstain when uncertain. Providing explicit abstention options consistently increases model uncertainty and safer abstention more than input perturbations. Scaling model size or advanced prompting provides little improvement in abstention capabilities.
Conclusion: Abstention mechanisms are central to trustworthy LLM deployment in high-stakes applications. The findings offer practical guidance for improving safety, highlighting that explicit abstention options are more effective than other approaches for encouraging safer behavior.
Abstract: Current evaluation of large language models (LLMs) overwhelmingly prioritizes accuracy; however, in real-world and safety-critical applications, the ability to abstain when uncertain is equally vital for trustworthy deployment. We introduce MedAbstain, a unified benchmark and evaluation protocol for abstention in medical multiple-choice question answering (MCQA) – a discrete-choice setting that generalizes to agentic action selection – integrating conformal prediction, adversarial question perturbations, and explicit abstention options. Our systematic evaluation of both open- and closed-source LLMs reveals that even state-of-the-art, high-accuracy models often fail to abstain with uncertain. Notably, providing explicit abstention options consistently increases model uncertainty and safer abstention, far more than input perturbations, while scaling model size or advanced prompting brings little improvement. These findings highlight the central role of abstention mechanisms for trustworthy LLM deployment and offer practical guidance for improving safety in high-stakes applications.
[83] Adversarial Alignment: Ensuring Value Consistency in Large Language Models for Sensitive Domains
Yuan Gao, Zhigang Liu, Xinyu Yao, Bo Chen, Xiaobing Zhao
Main category: cs.CL
TL;DR: Proposes VC-LLM, a value-consistent LLM for sensitive domains using adversarial alignment framework with attacker-actor-critic training.
Details
Motivation: Address bias and value inconsistency problems in LLMs, especially in sensitive domains like race, society, and politics.Method: Adversarial alignment framework with continued pre-training, instruction fine-tuning, and adversarial training using Attacker (generates controversial queries), Actor (generates value-consistent responses), and Critic (filters for quality).
Result: VC-LLM outperforms existing mainstream models in both Chinese and English tests on bilingual evaluation dataset.
Conclusion: The adversarial alignment framework effectively enhances value consistency in sensitive domains, as demonstrated by VC-LLM’s superior performance.
Abstract: With the wide application of large language models (LLMs), the problems of bias and value inconsistency in sensitive domains have gradually emerged, especially in terms of race, society and politics. In this paper, we propose an adversarial alignment framework, which enhances the value consistency of the model in sensitive domains through continued pre-training, instruction fine-tuning and adversarial training. In adversarial training, we use the Attacker to generate controversial queries, the Actor to generate responses with value consistency, and the Critic to filter and ensure response quality. Furthermore, we train a Value-Consistent Large Language Model, VC-LLM, for sensitive domains, and construct a bilingual evaluation dataset in Chinese and English. The experimental results show that VC-LLM performs better than the existing mainstream models in both Chinese and English tests, verifying the effectiveness of the method. Warning: This paper contains examples of LLMs that are offensive or harmful in nature.
[84] Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models
Hengyuan Zhang, Zhihao Zhang, Mingyang Wang, Zunhai Su, Yiwei Wang, Qianli Wang, Shuzhou Yuan, Ercong Nie, Xufeng Duan, Qibo Xue, Zeping Yu, Chenming Shang, Xiao Liang, Jing Xiong, Hui Shen, Chaofan Tao, Zhengwu Liu, Senjie Jin, Zhiheng Xi, Dongdong Zhang, Sophia Ananiadou, Tao Gui, Ruobing Xie, Hayden Kwok-Hay So, Hinrich Schütze, Xuanjing Huang, Qi Zhang, Ngai Wong
Main category: cs.CL
TL;DR: A practical survey on Mechanistic Interpretability (MI) that moves beyond observational analysis to provide a systematic “Locate, Steer, and Improve” framework for actionable intervention in LLMs.
Details
Motivation: Existing MI reviews treat it as observational science, summarizing insights but lacking systematic frameworks for actionable intervention. The authors aim to bridge this gap by creating a practical framework that enables tangible model improvements.Method: Proposes a structured pipeline: “Locate, Steer, and Improve.” Categorizes Localizing (diagnosis) and Steering (intervention) methods based on specific Interpretable Objects to establish rigorous intervention protocols.
Result: Demonstrates how the framework enables tangible improvements in Alignment, Capability, and Efficiency, effectively operationalizing MI as an actionable methodology for model optimization. Provides curated paper list at GitHub repository.
Conclusion: The survey transforms MI from passive observation to active intervention methodology, providing a systematic framework for practical model optimization through localization and steering techniques.
Abstract: Mechanistic Interpretability (MI) has emerged as a vital approach to demystify the opaque decision-making of Large Language Models (LLMs). However, existing reviews primarily treat MI as an observational science, summarizing analytical insights while lacking a systematic framework for actionable intervention. To bridge this gap, we present a practical survey structured around the pipeline: “Locate, Steer, and Improve.” We formally categorize Localizing (diagnosis) and Steering (intervention) methods based on specific Interpretable Objects to establish a rigorous intervention protocol. Furthermore, we demonstrate how this framework enables tangible improvements in Alignment, Capability, and Efficiency, effectively operationalizing MI as an actionable methodology for model optimization. The curated paper list of this work is available at https://github.com/rattlesnakey/Awesome-Actionable-MI-Survey.
[85] Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning
Yifan Wang, Shiyu Li, Peiming Li, Xiaochen Yang, Yang Tang, Zheng Wei
Main category: cs.CL
TL;DR: RoT (Render-of-Thought) is a framework that converts textual reasoning chains into images for token compression and faster inference while maintaining reasoning performance.
Details
Motivation: CoT prompting has computational overhead due to verbosity, lacks supervision on intermediate reasoning, and obscures analyzability of latent reasoning chains.Method: RoT reifies reasoning chains by rendering textual steps into images, using vision encoders of VLMs as semantic anchors to align vision embeddings with textual space for plug-and-play implementation.
Result: Achieves 3-4x token compression and substantial inference acceleration compared to explicit CoT while maintaining competitive performance on mathematical and logical reasoning benchmarks.
Conclusion: RoT validates the feasibility of rendering reasoning chains into images as an efficient alternative to verbose textual CoT, offering analyzable and compressed reasoning representation.
Abstract: Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational overhead. Recent works often focus exclusively on outcome alignment and lack supervision on the intermediate reasoning process. These deficiencies obscure the analyzability of the latent reasoning chain. To address these challenges, we introduce Render-of-Thought (RoT), the first framework to reify the reasoning chain by rendering textual steps into images, making the latent rationale explicit and traceable. Specifically, we leverage the vision encoders of existing Vision Language Models (VLMs) as semantic anchors to align the vision embeddings with the textual space. This design ensures plug-and-play implementation without incurring additional pre-training overhead. Extensive experiments on mathematical and logical reasoning benchmarks demonstrate that our method achieves 3-4x token compression and substantial inference acceleration compared to explicit CoT. Furthermore, it maintains competitive performance against other methods, validating the feasibility of this paradigm. Our code is available at https://github.com/TencentBAC/RoT
[86] The GDN-CC Dataset: Automatic Corpus Clarification for AI-enhanced Democratic Citizen Consultations
Pierre-Antoine Lequeu, Léo Labat, Laurène Cave, Gaël Lejeune, François Yvon, Benjamin Piwowarski
Main category: cs.CL
TL;DR: A framework for standardizing citizen consultation data using small LLMs to create structured argumentative units for political analysis.
Details
Motivation: To address ethical concerns about using large LLMs for analyzing democratic consultation data, and to develop standardized resources for making citizen contributions more usable for topic modeling and political analysis.Method: Introduces Corpus Clarification framework that transforms noisy consultation data into structured argumentative units. Creates GDN-CC dataset (1,231 contributions, 2,285 units) with manual annotations, then uses finetuned small language models to reproduce annotations and test on opinion clustering.
Result: Finetuned small language models match or outperform larger LLMs on annotation tasks. Releases GDN-CC-large, an automatically annotated corpus of 240k contributions - the largest annotated democratic consultation dataset to date.
Conclusion: Small, open-weights LLMs can effectively standardize citizen consultation data for political analysis, offering a transparent, locally-runnable alternative to large proprietary models while maintaining or improving performance.
Abstract: LLMs are ubiquitous in modern NLP, and while their applicability extends to texts produced for democratic activities such as online deliberations or large-scale citizen consultations, ethical questions have been raised for their usage as analysis tools. We continue this line of research with two main goals: (a) to develop resources that can help standardize citizen contributions in public forums at the pragmatic level, and make them easier to use in topic modeling and political analysis; (b) to study how well this standardization can reliably be performed by small, open-weights LLMs, i.e. models that can be run locally and transparently with limited resources. Accordingly, we introduce Corpus Clarification as a preprocessing framework for large-scale consultation data that transforms noisy, multi-topic contributions into structured, self-contained argumentative units ready for downstream analysis. We present GDN-CC, a manually-curated dataset of 1,231 contributions to the French Grand Débat National, comprising 2,285 argumentative units annotated for argumentative structure and manually clarified. We then show that finetuned Small Language Models match or outperform LLMs on reproducing these annotations, and measure their usability for an opinion clustering task. We finally release GDN-CC-large, an automatically annotated corpus of 240k contributions, the largest annotated democratic consultation dataset to date.
[87] LogicScore: Fine-grained Logic Evaluation of Conciseness, Completeness, and Determinateness in Attributed Question Answering
Zhichao Yan, Yunxiao Zhao, Jiapu Wang, Jiaoyan Chen, Shaoru Guo, Xiaoli Li, Ru Li, Jeff Z. Pan
Main category: cs.CL
TL;DR: LogicScore is a new evaluation framework for Attributed Question Answering that addresses attribution myopia by assessing global logical coherence rather than just isolated factual attributions.
Details
Motivation: Current AQA evaluation methods suffer from "attribution myopia" - they focus too much on verifying isolated statements and their attributions while ignoring the global logical integrity of long-form answers. LLMs often produce factually grounded but logically incoherent responses with deductive gaps.Method: LogicScore uses Horn Rules and integrates a backward verification mechanism to systematically evaluate three reasoning dimensions: Completeness (logically sound deduction), Conciseness (non-redundancy), and Determinateness (consistent answer entailment).
Result: Experiments across three multi-hop QA datasets (HotpotQA, MusiQue, 2WikiMultiHopQA) and over 20 LLMs reveal a critical capability gap: leading models achieve high attribution scores (e.g., 92.85% precision for Gemini-3 Pro) but struggle with global reasoning quality (e.g., 35.11% Conciseness for Gemini-3 Pro).
Conclusion: LogicScore establishes a robust standard for logical evaluation, highlighting the need to prioritize reasoning coherence alongside factual grounding in LLM development. The framework reveals significant gaps in current models’ reasoning capabilities despite strong factual attribution performance.
Abstract: Current evaluation methods for Attributed Question Answering (AQA) suffer from \textit{attribution myopia}: they emphasize verification of isolated statements and their attributions but overlook the global logical integrity of long-form answers. Consequently, Large Language Models (LLMs) often produce factually grounded yet logically incoherent responses with elusive deductive gaps. To mitigate this limitation, we present \textsc{LogicScore}, a unified evaluation framework that shifts the paradigm from local assessment to global reasoning scrutiny. Grounded in Horn Rules, our approach integrates a backward verification mechanism to systematically evaluate three key reasoning dimensions: \textit{Completeness} (logically sound deduction), \textit{Conciseness} (non-redundancy), and \textit{Determinateness} (consistent answer entailment). Extensive experiments across three multi-hop QA datasets (HotpotQA, MusiQue, and 2WikiMultiHopQA) and over 20 LLMs (including GPT-5, Gemini-3-Pro, LLaMA3, and task-specific tuned models) reveal a critical capability gap: leading models often achieve high attribution scores (e.g., 92.85% precision for Gemini-3 Pro) but struggle with global reasoning quality (e.g., 35.11% Conciseness for Gemini-3 Pro). Our work establishes a robust standard for logical evaluation, highlighting the need to prioritize reasoning coherence alongside factual grounding in LLM development. Codes are available at: https://github.com/zhichaoyan11/LogicScore.
cs.CV
[88] AI-Based Culvert-Sewer Inspection
Christina Thrainer
Main category: cs.CV
TL;DR: This thesis proposes three methods to improve automated defect segmentation in culverts and sewer pipes under data scarcity: data preprocessing techniques, a novel FORTRESS architecture, and few-shot learning approaches.
Details
Motivation: Culverts and sewer pipes are critical infrastructure whose failure poses serious risks. Data collection and annotation for defect detection is cumbersome and requires domain expertise, making large datasets infeasible. There's a need for methods that work with limited annotated data for real-world applicability.Method: Three approaches: 1) Preprocessing strategies including traditional data augmentation and dynamic label injection; 2) FORTRESS architecture combining depthwise separable convolutions, adaptive Kolmogorov-Arnold Networks (KAN), and multi-scale attention mechanisms; 3) Few-shot semantic segmentation using bidirectional prototypical networks with attention mechanisms.
Result: Preprocessing techniques significantly improved segmentation performance (IoU and F1 scores). FORTRESS achieved state-of-the-art performance on culvert sewer pipe defect dataset while reducing trainable parameters and computational cost. Few-shot learning approach achieved satisfactory results across evaluation metrics with limited data.
Conclusion: The thesis successfully addresses data scarcity in culvert and sewer pipe defect segmentation through three complementary approaches: data enhancement, efficient architecture design, and few-shot learning, demonstrating practical applicability to real-world scenarios with limited annotated data.
Abstract: Culverts and sewer pipes are critical components of drainage systems, and their failure can lead to serious risks to public safety and the environment. In this thesis, we explore methods to improve automated defect segmentation in culverts and sewer pipes. Collecting and annotating data in this field is cumbersome and requires domain knowledge. Having a large dataset for structural defect detection is therefore not feasible. Our proposed methods are tested under conditions with limited annotated data to demonstrate applicability to real-world scenarios. Overall, this thesis proposes three methods to significantly enhance defect segmentation and handle data scarcity. This can be addressed either by enhancing the training data or by adjusting a models architecture. First, we evaluate preprocessing strategies, including traditional data augmentation and dynamic label injection. These techniques significantly improve segmentation performance, increasing both Intersection over Union (IoU) and F1 score. Second, we introduce FORTRESS, a novel architecture that combines depthwise separable convolutions, adaptive Kolmogorov-Arnold Networks (KAN), and multi-scale attention mechanisms. FORTRESS achieves state-of-the-art performance on the culvert sewer pipe defect dataset, while significantly reducing the number of trainable parameters, as well as its computational cost. Finally, we investigate few-shot semantic segmentation and its applicability to defect detection. Few-shot learning aims to train models with only limited data available. By employing a bidirectional prototypical network with attention mechanisms, the model achieves richer feature representations and achieves satisfactory results across evaluation metrics.
[89] Evaluating Multimodal Large Language Models for Heterogeneous Face Recognition
Hatef Otroshi Shahreza, Anjith George, Sébastien Marcel
Main category: cs.CV
TL;DR: MLLMs perform poorly for heterogeneous face recognition compared to classical systems, especially in challenging cross-spectral conditions.
Details
Motivation: To evaluate the potential of Multimodal Large Language Models (MLLMs) for biometric applications, specifically heterogeneous face recognition where enrollment and probe images come from different sensing modalities.Method: Systematic evaluation of state-of-the-art MLLMs across multiple cross-modality scenarios (VIS-NIR, VIS-SWIR, VIS-THERMAL) using biometric protocols and metrics including Acquire Rate, Equal Error Rate, and True Accept Rate.
Result: Substantial performance gaps between MLLMs and classical face recognition systems, particularly under challenging cross-spectral conditions, despite recent advances in MLLMs.
Conclusion: Current MLLMs have limitations for heterogeneous face recognition, highlighting the importance of rigorous biometric evaluation before deployment in face recognition systems.
Abstract: Multimodal Large Language Models (MLLMs) have recently demonstrated strong performance on a wide range of vision-language tasks, raising interest in their potential use for biometric applications. In this paper, we conduct a systematic evaluation of state-of-the-art MLLMs for heterogeneous face recognition (HFR), where enrollment and probe images are from different sensing modalities, including visual (VIS), near infrared (NIR), short-wave infrared (SWIR), and thermal camera. We benchmark multiple open-source MLLMs across several cross-modality scenarios, including VIS-NIR, VIS-SWIR, and VIS-THERMAL face recognition. The recognition performance of MLLMs is evaluated using biometric protocols and based on different metrics, including Acquire Rate, Equal Error Rate (EER), and True Accept Rate (TAR). Our results reveal substantial performance gaps between MLLMs and classical face recognition systems, particularly under challenging cross-spectral conditions, in spite of recent advances in MLLMs. Our findings highlight the limitations of current MLLMs for HFR and also the importance of rigorous biometric evaluation when considering their deployment in face recognition systems.
[90] CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation
Pablo Messina, Andrés Villa, Juan León Alcázar, Karen Sánchez, Carlos Hinojosa, Denis Parra, Álvaro Soto, Bernard Ghanem
Main category: cs.CV
TL;DR: CURE is an error-aware curriculum learning framework that improves visual grounding and factual consistency in medical vision-language models for radiology report generation without needing additional data.
Details
Motivation: Existing medical vision-language models struggle with accurate visual grounding and factual consistency, often misaligning textual findings with visual evidence, leading to unreliable or weakly grounded predictions in radiology report generation.Method: CURE fine-tunes a multimodal instructional model using curriculum learning on three tasks: phrase grounding, grounded report generation, and anatomy-grounded report generation. It dynamically adjusts sample sampling based on model performance, emphasizing harder samples to improve spatial and textual alignment.
Result: CURE improves grounding accuracy by +0.37 IoU, boosts report quality by +0.188 CXRFEScore, and reduces hallucinations by 18.6%, demonstrating enhanced both grounding accuracy and report reliability.
Conclusion: CURE is a data-efficient framework that enhances visual grounding and factual consistency in medical vision-language models for radiology report generation without requiring additional training data, improving both spatial alignment and report reliability.
Abstract: Medical vision-language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual consistency. Existing models often misalign textual findings with visual evidence, leading to unreliable or weakly grounded predictions. We present CURE, an error-aware curriculum learning framework that improves grounding and report quality without any additional data. CURE fine-tunes a multimodal instructional model on phrase grounding, grounded report generation, and anatomy-grounded report generation using public datasets. The method dynamically adjusts sampling based on model performance, emphasizing harder samples to improve spatial and textual alignment. CURE improves grounding accuracy by +0.37 IoU, boosts report quality by +0.188 CXRFEScore, and reduces hallucinations by 18.6%. CURE is a data-efficient framework that enhances both grounding accuracy and report reliability. Code is available at https://github.com/PabloMessina/CURE and model weights at https://huggingface.co/pamessina/medgemma-4b-it-cure
[91] DuFal: Dual-Frequency-Aware Learning for High-Fidelity Extremely Sparse-view CBCT Reconstruction
Cuong Tran Van, Trong-Thang Pham, Ngoc-Son Nguyen, Duy Minh Ho Nguyen, Ngan Le
Main category: cs.CV
TL;DR: DuFal is a dual-frequency-aware learning framework for sparse-view CBCT reconstruction that combines frequency-domain and spatial-domain processing to better recover high-frequency anatomical details that conventional CNNs struggle with.
Details
Motivation: Sparse-view CBCT reconstruction from limited X-ray projections is challenging because conventional CNN-based methods are biased toward learning low-frequency information and struggle to recover fine-grained anatomical details (high-frequency components).Method: DuFal uses a dual-path architecture with: 1) High-Local Factorized Fourier Neural Operator with global and local high-frequency enhanced branches, 2) Spectral-Channel Factorization to reduce parameters, 3) Cross-Attention Frequency Fusion module to integrate spatial and frequency features, and 4) Intensity Field Decoding pipeline for final CT volume reconstruction.
Result: Experimental results on LUNA16 and ToothFairy datasets show DuFal significantly outperforms state-of-the-art methods in preserving high-frequency anatomical features, especially under extremely sparse-view settings.
Conclusion: DuFal effectively addresses the high-frequency recovery problem in sparse-view CBCT reconstruction through its dual-frequency-aware learning approach, demonstrating superior performance in preserving fine anatomical details.
Abstract: Sparse-view Cone-Beam Computed Tomography reconstruction from limited X-ray projections remains a challenging problem in medical imaging due to the inherent undersampling of fine-grained anatomical details, which correspond to high-frequency components. Conventional CNN-based methods often struggle to recover these fine structures, as they are typically biased toward learning low-frequency information. To address this challenge, this paper presents DuFal (Dual-Frequency-Aware Learning), a novel framework that integrates frequency-domain and spatial-domain processing via a dual-path architecture. The core innovation lies in our High-Local Factorized Fourier Neural Operator, which comprises two complementary branches: a Global High-Frequency Enhanced Fourier Neural Operator that captures global frequency patterns and a Local High-Frequency Enhanced Fourier Neural Operator that processes spatially partitioned patches to preserve spatial locality that might be lost in global frequency analysis. To improve efficiency, we design a Spectral-Channel Factorization scheme that reduces the Fourier Neural Operator parameter count. We also design a Cross-Attention Frequency Fusion module to integrate spatial and frequency features effectively. The fused features are then decoded through a Feature Decoder to produce projection representations, which are subsequently processed through an Intensity Field Decoding pipeline to reconstruct a final Computed Tomography volume. Experimental results on the LUNA16 and ToothFairy datasets demonstrate that DuFal significantly outperforms existing state-of-the-art methods in preserving high-frequency anatomical features, particularly under extremely sparse-view settings.
[92] DevPrompt: Deviation-Based Prompt Learning for One-Normal ShotImage Anomaly Detection
Morteza Poudineh, Marc Lalonde
Main category: cs.CV
TL;DR: A deviation-guided prompt learning framework for few-normal shot anomaly detection that combines CLIP’s vision-language capabilities with statistical deviation scoring for better patch-level anomaly localization.
Details
Motivation: Existing few-normal shot anomaly detection methods using vision-language models like CLIP have weak discriminability between normal/abnormal prompts and lack principled scoring mechanisms for patch-level anomalies.Method: Proposes a deviation-guided prompt learning framework with learnable context vectors (shared across prompts) and anomaly-specific suffix tokens. Uses deviation loss with Top-K Multiple Instance Learning to model patch-level features as Gaussian deviations from normal distribution.
Result: Superior pixel-level detection performance on MVTecAD and VISA benchmarks compared to PromptAD and other baselines. Ablation studies validate effectiveness of learnable prompts, deviation-based scoring, and Top-K MIL strategy.
Conclusion: The framework successfully integrates semantic power of vision-language models with statistical reliability of deviation-based scoring, improving anomaly localization and interpretability in few-normal shot settings.
Abstract: Few-normal shot anomaly detection (FNSAD) aims to detect abnormal regions in images using only a few normal training samples, making the task highly challenging due to limited supervision and the diversity of potential defects. Recent approaches leverage vision-language models such as CLIP with prompt-based learning to align image and text features. However, existing methods often exhibit weak discriminability between normal and abnormal prompts and lack principled scoring mechanisms for patch-level anomalies. We propose a deviation-guided prompt learning framework that integrates the semantic power of vision-language models with the statistical reliability of deviation-based scoring. Specifically, we replace fixed prompt prefixes with learnable context vectors shared across normal and abnormal prompts, while anomaly-specific suffix tokens enable class-aware alignment. To enhance separability, we introduce a deviation loss with Top-K Multiple Instance Learning (MIL), modeling patch-level features as Gaussian deviations from the normal distribution. This allows the network to assign higher anomaly scores to patches with statistically significant deviations, improving localization and interpretability. Experiments on the MVTecAD and VISA benchmarks demonstrate superior pixel-level detection performance compared to PromptAD and other baselines. Ablation studies further validate the effectiveness of learnable prompts, deviation-based scoring, and the Top-K MIL strategy.
[93] Seeing through Light and Darkness: Sensor-Physics Grounded Deblurring HDR NeRF from Single-Exposure Images and Events
Yunshan Qi, Lin Zhu, Nan Bao, Yifan Zhao, Jia Li
Main category: cs.CV
TL;DR: A NeRF framework that uses sensor-physics modeling to achieve sharp HDR novel view synthesis from single-exposure blurry LDR images and corresponding event data.
Details
Motivation: Existing methods for novel view synthesis from blurry LDR images with event data ignore sensor-physics mismatches between camera output and physical world radiance, leading to suboptimal HDR and deblurring results.Method: Proposes a unified sensor-physics grounded NeRF framework with two mapping fields: 1) pixel-wise RGB mapping field to align rendered HDR values with sensor-recorded LDR values, and 2) event mapping field to bridge physical scene dynamics with actual event sensor output. Both fields are jointly optimized with the NeRF network.
Result: Achieves state-of-the-art deblurring HDR novel view synthesis results on collected and public datasets using single-exposure blurry LDR images and corresponding events.
Conclusion: The sensor-physics grounded approach effectively addresses the gap between camera sensor output and physical world radiance, enabling superior HDR novel view synthesis from challenging blurry LDR inputs with event data.
Abstract: Novel view synthesis from low dynamic range (LDR) blurry images, which are common in the wild, struggles to recover high dynamic range (HDR) and sharp 3D representations in extreme lighting conditions. Although existing methods employ event data to address this issue, they ignore the sensor-physics mismatches between the camera output and physical world radiance, resulting in suboptimal HDR and deblurring results. To cope with this problem, we propose a unified sensor-physics grounded NeRF framework for sharp HDR novel view synthesis from single-exposure blurry LDR images and corresponding events. We employ NeRF to directly represent the actual radiance of the 3D scene in the HDR domain and model raw HDR scene rays hitting the sensor pixels as in the physical world. A pixel-wise RGB mapping field is introduced to align the above rendered pixel values with the sensor-recorded LDR pixel values of the input images. A novel event mapping field is also designed to bridge the physical scene dynamics and actual event sensor output. The two mapping fields are jointly optimized with the NeRF network, leveraging the spatial and temporal dynamic information in events to enhance the sharp HDR 3D representation learning. Experiments on the collected and public datasets demonstrate that our method can achieve state-of-the-art deblurring HDR novel view synthesis results with single-exposure blurry LDR images and corresponding events.
[94] Hybrid Vision Transformer_GAN Attribute Neutralizer for Mitigating Bias in Chest X_Ray Diagnosis
Jobeal Solomon, Ali Mohammed Mansoor Alsahag, Seyed Sahand Mohammadi Ziabari
Main category: cs.CV
TL;DR: Vision Transformer backbone in attribute-neutral framework reduces demographic bias in chest X-ray classifiers better than convolutional U-Net, cutting sex-recognition AUC by ~10 percentage points while maintaining diagnostic accuracy.
Details
Motivation: Chest X-ray classifiers often exhibit bias from sex- and age-related shortcuts, causing systematic underdiagnosis of minority subgroups. Existing pixel-space attribute neutralizers using convolutional encoders don't fully eliminate attribute leakage at clinically usable edit strengths.Method: Replaced U-Net convolutional encoder with Vision Transformer backbone in Attribute-Neutral Framework. Trained a data-efficient Image Transformer Small (DeiT-S) neutralizer on ChestX-ray14 dataset. Generated edited images across 11 edit-intensity levels and evaluated with independent AI judge for attribute leakage and CNN for disease prediction.
Result: At moderate edit level (alpha=0.5), ViT neutralizer reduces patient sex-recognition AUC to ~0.80 (10 percentage points below original convolutional U-Net encoder), despite half the training epochs. Diagnostic performance maintained: macro ROC AUC across 15 findings stays within 5 percentage points of baseline, worst-case subgroup AUC remains near 0.70.
Conclusion: Global self-attention vision models can further suppress attribute leakage without sacrificing clinical utility, offering a practical route toward fairer chest X-ray AI by reducing demographic bias while preserving diagnostic accuracy.
Abstract: Bias in chest X-ray classifiers frequently stems from sex- and age-related shortcuts, leading to systematic underdiagnosis of minority subgroups. Previous pixel-space attribute neutralizers, which rely on convolutional encoders, lessen but do not fully remove this attribute leakage at clinically usable edit strengths. This study evaluates whether substituting the U-Net convolutional encoder with a Vision Transformer backbone in the Attribute-Neutral Framework can reduce demographic attribute leakage while preserving diagnostic accuracy. A data-efficient Image Transformer Small (DeiT-S) neutralizer was trained on the ChestX-ray14 dataset. Its edited images, generated across eleven edit-intensity levels, were evaluated with an independent AI judge for attribute leakage and with a convolutional neural network (ConvNet) for disease prediction. At a moderate edit level (alpha = 0.5), the Vision Transformer (ViT) neutralizer reduces patient sex-recognition area under the curve (AUC) to approximately 0.80, about 10 percentage points below the original framework’s convolutional U-Net encoder, despite being trained for only half as many epochs. Meanwhile, macro receiver operating characteristic area under the curve (ROC AUC) across 15 findings stays within five percentage points of the unedited baseline, and the worst-case subgroup AUC remains near 0.70. These results indicate that global self-attention vision models can further suppress attribute leakage without sacrificing clinical utility, suggesting a practical route toward fairer chest X-ray AI.
[95] Controllable Layered Image Generation for Real-World Editing
Jinrui Yang, Qing Liu, Yijun Li, Mengwei Ren, Letian Zhang, Zhe Lin, Cihang Xie, Yuyin Zhou
Main category: cs.CV
TL;DR: LASAGNA is a unified framework that generates images with layered representations (background + transparent foreground with visual effects) for controllable editing, using a new dataset and benchmark.
Details
Motivation: Existing image generation models struggle with controllable editing of specific elements, and layered representations often lack coherent compositing relationships and realistic visual effects like shadows and reflections.Method: Proposes LASAGNA framework that jointly generates images with composing layers from multiple conditioning inputs (text prompts, foreground, background, location masks). Introduces LASAGNA-48K dataset of clean backgrounds and RGBA foregrounds with physically grounded visual effects, and LASAGNABENCH benchmark for layer editing.
Result: LASAGNA excels in generating highly consistent and coherent results across multiple image layers simultaneously, enabling diverse post-editing applications that accurately preserve identity and visual effects.
Conclusion: LASAGNA offers greater controllability for real-world applications and will release LASAGNA-48K dataset and LASAGNABENCH benchmark to foster open research in layered image generation and editing.
Abstract: Recent image generation models have shown impressive progress, yet they often struggle to yield controllable and consistent results when users attempt to edit specific elements within an existing image. Layered representations enable flexible, user-driven content creation, but existing approaches often fail to produce layers with coherent compositing relationships, and their object layers typically lack realistic visual effects such as shadows and reflections. To overcome these limitations, we propose LASAGNA, a novel, unified framework that generates an image jointly with its composing layers–a photorealistic background and a high-quality transparent foreground with compelling visual effects. Unlike prior work, LASAGNA efficiently learns correct image composition from a wide range of conditioning inputs–text prompts, foreground, background, and location masks–offering greater controllability for real-world applications. To enable this, we introduce LASAGNA-48K, a new dataset composed of clean backgrounds and RGBA foregrounds with physically grounded visual effects. We also propose LASAGNABENCH, the first benchmark for layer editing. We demonstrate that LASAGNA excels in generating highly consistent and coherent results across multiple image layers simultaneously, enabling diverse post-editing applications that accurately preserve identity and visual effects. LASAGNA-48K and LASAGNABENCH will be publicly released to foster open research in the community. The project page is https://rayjryang.github.io/LASAGNA-Page/.
[96] DeltaDorsal: Enhancing Hand Pose Estimation with Dorsal Features in Egocentric Views
William Huang, Siyou Pei, Leyi Zou, Eric J. Gonzalez, Ishan Chatterjee, Yang Zhang
Main category: cs.CV
TL;DR: Novel hand pose estimation method using dorsal skin deformation features to overcome finger occlusion challenges in XR devices, achieving 18% MPJAE reduction in heavily occluded scenarios.
Details
Motivation: Egocentric hand pose estimation in XR devices faces significant challenges due to frequent finger occlusions, which degrade performance and reliability for interactive applications.Method: Dual-stream delta encoder that learns pose by contrasting features from dynamic hand with baseline relaxed position, using only cropped dorsal images and leveraging recent dense visual featurizers.
Result: Reduces Mean Per Joint Angle Error (MPJAE) by 18% in self-occluded scenarios (fingers >=50% occluded) compared to state-of-the-art methods, while using smaller model size.
Conclusion: Enhances reliability of downstream tasks like index finger pinch/tap estimation in occluded scenarios and enables new interaction paradigms like detecting isometric force for surface “click” without visible movement.
Abstract: The proliferation of XR devices has made egocentric hand pose estimation a vital task, yet this perspective is inherently challenged by frequent finger occlusions. To address this, we propose a novel approach that leverages the rich information in dorsal hand skin deformation, unlocked by recent advances in dense visual featurizers. We introduce a dual-stream delta encoder that learns pose by contrasting features from a dynamic hand with a baseline relaxed position. Our evaluation demonstrates that, using only cropped dorsal images, our method reduces the Mean Per Joint Angle Error (MPJAE) by 18% in self-occluded scenarios (fingers >=50% occluded) compared to state-of-the-art techniques that depend on the whole hand’s geometry and large model backbones. Consequently, our method not only enhances the reliability of downstream tasks like index finger pinch and tap estimation in occluded scenarios but also unlocks new interaction paradigms, such as detecting isometric force for a surface “click” without visible movement while minimizing model size.
[97] VIOLA: Towards Video In-Context Learning with Minimal Annotations
Ryo Fujii, Hideo Saito, Ryo Hachiuma
Main category: cs.CV
TL;DR: VIOLA: A label-efficient framework for adapting multimodal LLMs to novel video domains using minimal expert annotations and abundant unlabeled data through density-uncertainty sampling and confidence-aware mechanisms.
Details
Motivation: Adapting MLLMs to novel video domains is essential for real-world deployment but challenging due to scarce labeled data. Standard ICL methods require large annotated pools which are impractical in specialized settings like industrial or surgical environments where expert annotations are expensive and limited.Method: VIOLA combines minimal expert supervision with abundant unlabeled data using: 1) Density-uncertainty-weighted sampling to maximize annotation efficiency by selecting diverse, representative, and informative samples; 2) Hybrid pool construction with confidence-aware retrieval and prompting that explicitly models label reliability, enabling MLLMs to distinguish between verified ground truths and noisy pseudo-labels.
Result: Extensive experiments across nine diverse benchmarks using four MLLMs show that VIOLA significantly outperforms various baselines in low-resource settings, achieving robust adaptation with minimal annotation costs.
Conclusion: VIOLA provides an effective framework for label-efficient adaptation of MLLMs to novel video domains, bridging the gap between the need for expert annotations and practical deployment constraints in specialized environments.
Abstract: Generalizing Multimodal Large Language Models (MLLMs) to novel video domains is essential for real-world deployment but remains challenging due to the scarcity of labeled data. While In-Context Learning (ICL) offers a training-free adaptation path, standard methods rely on large annotated pools, which are often impractical in specialized environments like industrial or surgical settings since they require the experts’ annotations. To bridge this gap, we introduce VIOLA (Video In-cOntext Learning with minimal Annotation), a label-efficient framework that synergizes minimal expert supervision with abundant unlabeled data. First, to maximize the efficiency of a strict annotation budget, we propose density-uncertainty-weighted sampling. Unlike standard diversity or uncertainty strategies that risk selecting visual outliers, our method leverages density estimation to identify samples that are simultaneously diverse, representative, and informative. Second, to utilize the remaining unlabeled data without noise propagation, we construct a hybrid pool and introduce confidence-aware retrieval and confidence-aware prompting. These mechanisms explicitly model label reliability, retrieving demonstrations based on a composite score of similarity and confidence while enabling the MLLM to adaptively distinguish between verified ground truths and noisy pseudo-labels. Extensive experiments across nine diverse benchmarks using four MLLMs demonstrate that our framework significantly outperforms various baselines in low-resource settings, achieving robust adaptation with minimal annotation costs.
[98] Relative Classification Accuracy: A Calibrated Metric for Identity Consistency in Fine-Grained K-pop Face Generation
Sylvey Lin, Eranki Vasistha
Main category: cs.CV
TL;DR: DDPMs generate high-quality K-pop idol faces but struggle with semantic control and identity consistency, especially for visually similar identities. The paper proposes RCA metric to measure this trade-off.
Details
Motivation: Standard metrics like FID and IS fail to detect identity misalignment in fine-grained, single-domain generation tasks like K-pop idol faces where inter-class similarity is high. There's a need for better evaluation of semantic controllability in conditional generative models.Method: Use Class-Conditional DDPMs for K-pop idol face generation at 32x32 resolution. Propose Relative Classification Accuracy (RCA) metric that normalizes generative performance against an oracle classifier’s baseline to measure identity consistency.
Result: Model achieves high visual quality (FID 8.93) but suffers from severe semantic mode collapse (RCA 0.27), especially for visually ambiguous identities. Analysis via confusion matrices reveals failure modes due to resolution constraints and intra-gender ambiguity.
Conclusion: There’s a critical trade-off between visual quality and semantic control in conditional DDPMs. The proposed RCA metric provides a rigorous framework for verifying identity consistency in specialized generation tasks, highlighting limitations of current models for fine-grained semantic control.
Abstract: Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in high-fidelity image generation. However, evaluating their semantic controllability-specifically for fine-grained, single-domain tasks-remains challenging. Standard metrics like FID and Inception Score (IS) often fail to detect identity misalignment in such specialized contexts. In this work, we investigate Class-Conditional DDPMs for K-pop idol face generation (32x32), a domain characterized by high inter-class similarity. We propose a calibrated metric, Relative Classification Accuracy (RCA), which normalizes generative performance against an oracle classifier’s baseline. Our evaluation reveals a critical trade-off: while the model achieves high visual quality (FID 8.93), it suffers from severe semantic mode collapse (RCA 0.27), particularly for visually ambiguous identities. We analyze these failure modes through confusion matrices and attribute them to resolution constraints and intra-gender ambiguity. Our framework provides a rigorous standard for verifying identity consistency in conditional generative models.
[99] Simulating Dual-Pixel Images From Ray Tracing For Depth Estimation
Fengchen He, Dayang Zhao, Hao Xu, Tingwei Quan, Shaoqun Zeng
Main category: cs.CV
TL;DR: Proposes Sdirt method to generate realistic dual-pixel images via ray tracing to bridge domain gap between simulated and real DP data for better depth estimation generalization.
Details
Motivation: DP-depth paired datasets are scarce, especially for customized cameras. Existing DP image simulations violate real optical propagation laws, leading to poor generalization to real DP data.Method: Sdirt (Simulating DP images from ray tracing) scheme generates realistic DP images via ray tracing and integrates them into depth estimation training pipeline.
Result: Models trained with Sdirt-simulated images generalize better to real DP data compared to previous simulation methods.
Conclusion: Ray tracing-based DP image simulation effectively bridges the domain gap between simulated and real DP data, improving depth estimation performance on real cameras.
Abstract: Many studies utilize dual-pixel (DP) sensor phase characteristics for various applications, such as depth estimation and deblurring. However, since the DP image features are entirely determined by the camera hardware, DP-depth paired datasets are very scarce, especially when performing depth estimation on customized cameras. To overcome this, studies simulate DP images using ideal optical system models. However, these simulations often violate real optical propagation laws, leading to poor generalization to real DP data. To address this, we investigate the domain gap between simulated and real DP data, and propose solutions using the Simulating DP images from ray tracing (Sdirt) scheme. The Sdirt generates realistic DP images via ray tracing and integrates them into the depth estimation training pipeline. Experimental results show that models trained with Sdirt-simulated images generalize better to real DP data. The code and collected datasets will be available at github.com/LinYark/Sdirt
[100] Region-aware Spatiotemporal Modeling with Collaborative Domain Generalization for Cross-Subject EEG Emotion Recognition
Weiwei Wu, Yueyang Li, Yuhu Shi, Weiming Zeng, Lang Qin, Yang Yang, Ke Zhou, Zhiguo Zhang, Wai Ting Siok, Nizhuan Wang
Main category: cs.CV
TL;DR: RSM-CoDG is a novel framework for cross-subject EEG emotion recognition that combines region-aware spatial modeling, multi-scale temporal modeling, and collaborative domain generalization to address inter-subject variability and improve generalization to unseen individuals.
Details
Motivation: Cross-subject EEG emotion recognition faces challenges from strong inter-subject variability causing distribution shifts, complex spatial organization of emotion-related neural representations, and temporal evolution of neural activity. Existing methods typically address spatial modeling, temporal modeling, or generalization strategies separately, limiting their ability to align representations across subjects while capturing multi-scale dynamics and suppressing subject-specific bias within a unified framework.Method: Proposes RSM-CoDG (Region-aware Spatiotemporal Modeling with Collaborative Domain Generalization) framework that: 1) Incorporates neuroscience priors from functional brain region partitioning to construct region-level spatial representations for better cross-subject comparability; 2) Employs multi-scale temporal modeling to characterize dynamic evolution of emotion-evoked neural activity; 3) Uses collaborative domain generalization strategy with multidimensional constraints to reduce subject-specific bias in fully unseen target subject settings.
Result: Extensive experimental results on SEED series datasets demonstrate that RSM-CoDG consistently outperforms existing competing methods, providing an effective approach for improving robustness in cross-subject EEG emotion recognition.
Conclusion: RSM-CoDG offers a comprehensive solution that integrates neuroscience-inspired spatial modeling, multi-scale temporal analysis, and advanced domain generalization techniques to address the core challenges in cross-subject EEG-based emotion recognition, achieving superior performance and better generalization to unknown individuals.
Abstract: Cross-subject EEG-based emotion recognition (EER) remains challenging due to strong inter-subject variability, which induces substantial distribution shifts in EEG signals, as well as the high complexity of emotion-related neural representations in both spatial organization and temporal evolution. Existing approaches typically improve spatial modeling, temporal modeling, or generalization strategies in isolation, which limits their ability to align representations across subjects while capturing multi-scale dynamics and suppressing subject-specific bias within a unified framework. To address these gaps, we propose a Region-aware Spatiotemporal Modeling framework with Collaborative Domain Generalization (RSM-CoDG) for cross-subject EEG emotion recognition. RSM-CoDG incorporates neuroscience priors derived from functional brain region partitioning to construct region-level spatial representations, thereby improving cross-subject comparability. It also employs multi-scale temporal modeling to characterize the dynamic evolution of emotion-evoked neural activity. In addition, the framework employs a collaborative domain generalization strategy, incorporating multidimensional constraints to reduce subject-specific bias in a fully unseen target subject setting, which enhances the generalization to unknown individuals. Extensive experimental results on SEED series datasets demonstrate that RSM-CoDG consistently outperforms existing competing methods, providing an effective approach for improving robustness. The source code is available at https://github.com/RyanLi-X/RSM-CoDG.
[101] Emergence and Evolution of Interpretable Concepts in Diffusion Models
Berk Tinaz, Zalan Fabian, Mahdi Soltanolkotabi
Main category: cs.CV
TL;DR: SAEs applied to diffusion models reveal interpretable concepts, enabling prediction of scene composition early in generation and controllable interventions at different stages.
Details
Motivation: Diffusion models are powerful for text-to-image generation but remain black-box systems. While SAEs have successfully interpreted LLMs, they haven't been applied to understand diffusion models' complex generative processes.Method: Leverage Sparse Autoencoders (SAEs) to probe a popular text-to-image diffusion model’s activations, identify human-interpretable concepts, and design intervention techniques for manipulating image composition and style.
Result: Found interpretable concepts in diffusion model activations; early stages allow accurate prediction of final scene composition; interventions show composition control in early stages, style manipulation in middle stages, and only texture changes in final stages.
Conclusion: SAEs provide valuable mechanistic interpretability for diffusion models, revealing stage-specific controllability: composition control early, style control mid-stage, and only texture adjustments late.
Abstract: Diffusion models have become the go-to method for text-to-image generation, producing high-quality images from pure noise. However, the inner workings of diffusion models is still largely a mystery due to their black-box nature and complex, multi-step generation process. Mechanistic interpretability techniques, such as Sparse Autoencoders (SAEs), have been successful in understanding and steering the behavior of large language models at scale. However, the great potential of SAEs has not yet been applied toward gaining insight into the intricate generative process of diffusion models. In this work, we leverage the SAE framework to probe the inner workings of a popular text-to-image diffusion model, and uncover a variety of human-interpretable concepts in its activations. Interestingly, we find that even before the first reverse diffusion step is completed, the final composition of the scene can be predicted surprisingly well by looking at the spatial distribution of activated concepts. Moreover, going beyond correlational analysis, we design intervention techniques aimed at manipulating image composition and style, and demonstrate that (1) in early stages of diffusion image composition can be effectively controlled, (2) in the middle stages image composition is finalized, however stylistic interventions are effective, and (3) in the final stages only minor textural details are subject to change.
[102] Explainable Deepfake Detection with RL Enhanced Self-Blended Images
Ning Jiang, Dingheng Zeng, Yanhong Liu, Haiyang Yi, Shijie Yu, Minghe Weng, Haifeng Shen, Ying Li
Main category: cs.CV
TL;DR: Proposes RL-enhanced MLLM framework for interpretable deepfake detection with automated CoT data generation to address annotation scarcity.
Details
Motivation: Current deepfake detection lacks explainable outputs, and applying MLLMs faces challenges due to scarce high-quality annotated datasets. RL shows promise for improving cross-domain generalization in visual tasks.Method: Automated Chain-of-Thought data generation framework based on Self-Blended Images, combined with RL-enhanced deepfake detection framework with tailored reward mechanism and feedback-driven synthetic data generation.
Result: Extensive experiments validate effectiveness of CoT data construction pipeline, reward mechanism, and synthetic data approach. Achieves competitive performance with SOTA across multiple cross-dataset benchmarks.
Conclusion: Proposed framework facilitates MLLM adoption in deepfake detection with reduced annotation costs and demonstrates RL’s potential for improving interpretable detection with competitive cross-domain performance.
Abstract: Most prior deepfake detection methods lack explainable outputs. With the growing interest in multimodal large language models (MLLMs), researchers have started exploring their use in interpretable deepfake detection. However, a major obstacle in applying MLLMs to this task is the scarcity of high-quality datasets with detailed forgery attribution annotations, as textual annotation is both costly and challenging - particularly for high-fidelity forged images or videos. Moreover, multiple studies have shown that reinforcement learning (RL) can substantially enhance performance in visual tasks, especially in improving cross-domain generalization. To facilitate the adoption of mainstream MLLM frameworks in deepfake detection with reduced annotation cost, and to investigate the potential of RL in this context, we propose an automated Chain-of-Thought (CoT) data generation framework based on Self-Blended Images, along with an RL-enhanced deepfake detection framework. Extensive experiments validate the effectiveness of our CoT data construction pipeline, tailored reward mechanism, and feedback-driven synthetic data generation approach. Our method achieves performance competitive with state-of-the-art (SOTA) approaches across multiple cross-dataset benchmarks. Implementation details are available at https://github.com/deon1219/rlsbi.
[103] Evolving Without Ending: Unifying Multimodal Incremental Learning for Continual Panoptic Perception
Bo Yuan, Danpei Zhao, Wentao Li, Tian Li, Zhiguo Jiang
Main category: cs.CV
TL;DR: The paper proposes Continual Panoptic Perception (CPP), an end-to-end model for multimodal multi-task continual learning that addresses catastrophic forgetting and semantic obfuscation through collaborative cross-modal encoding, knowledge inheritance via distillation, cross-modal consistency constraints, and asymmetric pseudo-labeling without exemplar replay.
Details
Motivation: Current continual learning research focuses too much on single-task scenarios, limiting applications in multi-task and multimodal settings. Beyond catastrophic forgetting, multi-task continual learning introduces semantic obfuscation across multimodal alignment, causing severe model degradation during incremental training.Method: Proposes CPP model with: 1) Collaborative Cross-modal Encoder (CCE) for multimodal embedding; 2) Malleable knowledge inheritance via contrastive feature distillation and instance distillation; 3) Cross-modal consistency constraint (CPP+) for semantic alignment; 4) Asymmetric pseudo-labeling for model evolution without exemplar replay.
Result: Extensive experiments on multimodal datasets and diverse continual learning tasks demonstrate the model’s superiority, particularly in fine-grained continual learning tasks, showing effective handling of multimodal multi-task incremental scenarios.
Conclusion: The paper successfully extends continual learning to continual panoptic perception, integrating multimodal and multi-task continual learning to enhance comprehensive image perception through joint interpretation at pixel, instance, and image levels, addressing key challenges in real-world AI perception systems.
Abstract: Continual learning (CL) is a great endeavour in developing intelligent perception AI systems. However, the pioneer research has predominantly focus on single-task CL, which restricts the potential in multi-task and multimodal scenarios. Beyond the well-known issue of catastrophic forgetting, the multi-task CL also brings semantic obfuscation across multimodal alignment, leading to severe model degradation during incremental training steps. In this paper, we extend CL to continual panoptic perception (CPP), integrating multimodal and multi-task CL to enhance comprehensive image perception through pixel-level, instance-level, and image-level joint interpretation. We formalize the CL task in multimodal scenarios and propose an end-to-end continual panoptic perception model. Concretely, CPP model features a collaborative cross-modal encoder (CCE) for multimodal embedding. We also propose a malleable knowledge inheritance module via contrastive feature distillation and instance distillation, addressing catastrophic forgetting from task-interactive boosting manner. Furthermore, we propose a cross-modal consistency constraint and develop CPP+, ensuring multimodal semantic alignment for model updating under multi-task incremental scenarios. Additionally, our proposed model incorporates an asymmetric pseudo-labeling manner, enabling model evolving without exemplar replay. Extensive experiments on multimodal datasets and diverse CL tasks demonstrate the superiority of the proposed model, particularly in fine-grained CL tasks.
[104] SuperOcc: Toward Cohesive Temporal Modeling for Superquadric-based Occupancy Prediction
Zichen Yu, Quanli Liu, Wei Wang, Liyong Zhang, Xiaoguang Zhao
Main category: cs.CV
TL;DR: SuperOcc is a novel 3D occupancy prediction framework using superquadric representations that addresses sparsity-efficiency trade-offs with temporal modeling, multi-superquadric decoding, and efficient splatting.
Details
Motivation: Existing 3D occupancy prediction methods use dense representations that overlook scene sparsity, while superquadric approaches suffer from insufficient temporal modeling, sparsity-geometry trade-offs, and inefficient splatting.Method: SuperOcc introduces: (1) cohesive temporal modeling using both view-centric and object-centric cues, (2) multi-superquadric decoding to enhance geometry without sacrificing sparsity, and (3) efficient superquadric-to-voxel splatting scheme.
Result: Achieves state-of-the-art performance on SurroundOcc and Occ3D benchmarks while maintaining superior computational efficiency compared to existing methods.
Conclusion: SuperOcc successfully addresses key limitations of superquadric-based 3D occupancy prediction, demonstrating that sparse superquadric representations can achieve both high performance and efficiency for autonomous driving applications.
Abstract: 3D occupancy prediction plays a pivotal role in the realm of autonomous driving, as it provides a comprehensive understanding of the driving environment. Most existing methods construct dense scene representations for occupancy prediction, overlooking the inherent sparsity of real-world driving scenes. Recently, 3D superquadric representation has emerged as a promising sparse alternative to dense scene representations due to the strong geometric expressiveness of superquadrics. However, existing superquadric frameworks still suffer from insufficient temporal modeling, a challenging trade-off between query sparsity and geometric expressiveness, and inefficient superquadric-to-voxel splatting. To address these issues, we propose SuperOcc, a novel framework for superquadric-based 3D occupancy prediction. SuperOcc incorporates three key designs: (1) a cohesive temporal modeling mechanism to simultaneously exploit view-centric and object-centric temporal cues; (2) a multi-superquadric decoding strategy to enhance geometric expressiveness without sacrificing query sparsity; and (3) an efficient superquadric-to-voxel splatting scheme to improve computational efficiency. Extensive experiments on the SurroundOcc and Occ3D benchmarks demonstrate that SuperOcc achieves state-of-the-art performance while maintaining superior efficiency. The code is available at https://github.com/Yzichen/SuperOcc.
[105] Event-VStream: Event-Driven Real-Time Understanding for Long Video Streams
Zhenghui Guo, Yuanbin Man, Junyuan Sheng, Bowen Lin, Ahmed Ahmed, Bo Jiang, Boyuan Zhang, Miao Yin, Sian Jin, Omprakash Gnawal, Chengming Zhang
Main category: cs.CV
TL;DR: Event-VStream is an event-aware framework for real-time long video understanding that processes video as discrete semantic events rather than fixed frames, improving efficiency and temporal reasoning.
Details
Motivation: Current VLMs struggle with real-time long video streams due to redundant frame processing and forgetting past context. Existing streaming systems use inefficient fixed-interval decoding or cache pruning that loses temporal information.Method: Represents continuous video as sequence of discrete, semantically coherent events. Detects meaningful state transitions using motion, semantic, and predictive cues, triggering language generation only at event boundaries. Consolidates event embeddings into persistent memory bank for long-horizon reasoning.
Result: Achieves +10.4 points improvement over VideoLLM-Online-8B on OVOBench-Realtime. Performs close to Flash-VStream-7B using only general-purpose LLaMA-3-8B text backbone. Maintains ~70% GPT-5 win rate on 2-hour Ego4D streams.
Conclusion: Event-VStream enables efficient real-time long video understanding by processing video as semantic events rather than redundant frames, achieving competitive performance while maintaining low latency and preserving temporal context.
Abstract: Real-time understanding of long video streams remains challenging for multimodal large language models (VLMs) due to redundant frame processing and rapid forgetting of past context. Existing streaming systems rely on fixed-interval decoding or cache pruning, which either produce repetitive outputs or discard crucial temporal information. We introduce Event-VStream, an event-aware framework that represents continuous video as a sequence of discrete, semantically coherent events. Our system detects meaningful state transitions by integrating motion, semantic, and predictive cues, and triggers language generation only at those boundaries. Each event embedding is consolidated into a persistent memory bank, enabling long-horizon reasoning while maintaining low latency. Across OVOBench-Realtime, and long-form Ego4D evaluations, Event-VStream achieves competitive performance. It improves over a VideoLLM-Online-8B baseline by +10.4 points on OVOBench-Realtime, achieves performance close to Flash-VStream-7B despite using only a general-purpose LLaMA-3-8B text backbone, and maintains around 70% GPT-5 win rate on 2-hour Ego4D streams.
[106] Skywork UniPic 3.0: Unified Multi-Image Composition via Sequence Modeling
Hongyang Wei, Hongbo Liu, Zidong Wang, Yi Peng, Baixin Xu, Size Wu, Xuying Zhang, Xianglong He, Zexiang Liu, Peiyu Wang, Xuchen Song, Yangguang Li, Yang Liu, Yahui Zhou
Main category: cs.CV
TL;DR: Skywork UniPic 3.0 is a unified multimodal framework that excels at both single-image editing and multi-image composition, achieving SOTA performance with efficient 8-step inference and 12.5x speedup.
Details
Motivation: The community shows strong interest in multi-image composition (especially Human-Object Interaction tasks), but existing models lack disclosed methods for achieving high-quality fusion with consistency and quality challenges.Method: 1) Comprehensive data pipeline with 700K high-quality samples; 2) Novel training paradigm treating multi-image composition as sequence modeling; 3) Post-training integration of trajectory mapping and distribution matching for accelerated inference.
Result: Achieves SOTA on single-image editing benchmark and surpasses Nano-Banana and Seedream 4.0 on multi-image composition benchmark. Supports arbitrary input images (1~6) and resolutions within 1024x1024 pixel budget.
Conclusion: Skywork UniPic 3.0 effectively addresses multi-image composition challenges through innovative data pipeline and training paradigm, validating the approach with superior performance and efficient inference.
Abstract: The recent surge in popularity of Nano-Banana and Seedream 4.0 underscores the community’s strong interest in multi-image composition tasks. Compared to single-image editing, multi-image composition presents significantly greater challenges in terms of consistency and quality, yet existing models have not disclosed specific methodological details for achieving high-quality fusion. Through statistical analysis, we identify Human-Object Interaction (HOI) as the most sought-after category by the community. We therefore systematically analyze and implement a state-of-the-art solution for multi-image composition with a primary focus on HOI-centric tasks. We present Skywork UniPic 3.0, a unified multimodal framework that integrates single-image editing and multi-image composition. Our model supports an arbitrary (1~6) number and resolution of input images, as well as arbitrary output resolutions (within a total pixel budget of 1024x1024). To address the challenges of multi-image composition, we design a comprehensive data collection, filtering, and synthesis pipeline, achieving strong performance with only 700K high-quality training samples. Furthermore, we introduce a novel training paradigm that formulates multi-image composition as a sequence-modeling problem, transforming conditional generation into unified sequence synthesis. To accelerate inference, we integrate trajectory mapping and distribution matching into the post-training stage, enabling the model to produce high-fidelity samples in just 8 steps and achieve a 12.5x speedup over standard synthesis sampling. Skywork UniPic 3.0 achieves state-of-the-art performance on single-image editing benchmark and surpasses both Nano-Banana and Seedream 4.0 on multi-image composition benchmark, thereby validating the effectiveness of our data pipeline and training paradigm. Code, models and dataset are publicly available.
[107] Consistency-Regularized GAN for Few-Shot SAR Target Recognition
Yikui Zhai, Shikuang Liu, Wenlve Zhou, Hongsheng Zhang, Zhiheng Zhou, Xiaolin Tian, C. L. Philip Chen
Main category: cs.CV
TL;DR: Cr-GAN is a consistency-regularized GAN framework that synthesizes diverse SAR images for few-shot learning by decoupling adversarial training from representation learning, enabling effective data augmentation with limited samples.
Details
Motivation: Few-shot SAR recognition faces extreme data scarcity. Traditional GAN-based data augmentation requires abundant training data, creating a paradox in few-shot scenarios where data is limited.Method: Proposes Cr-GAN with dual-branch discriminator to separate adversarial training from representation learning. Uses channel-wise feature interpolation to create novel latent features and dual-domain cycle consistency for semantic integrity. Framework is adaptable to various GAN architectures.
Result: Achieves 71.21% accuracy on MSTAR and 51.64% on SRSDD in 8-shot setting, significantly outperforming baselines. Requires only ~5% parameters of state-of-the-art diffusion models.
Conclusion: Cr-GAN resolves the GAN training paradox in few-shot SAR recognition, enabling effective data synthesis with limited samples to boost self-supervised learning performance.
Abstract: Few-shot recognition in synthetic aperture radar (SAR) imagery remains a critical bottleneck for real-world applications due to extreme data scarcity. A promising strategy involves synthesizing a large dataset with a generative adversarial network (GAN), pre-training a model via self-supervised learning (SSL), and then fine-tuning on the few labeled samples. However, this approach faces a fundamental paradox: conventional GANs themselves require abundant data for stable training, contradicting the premise of few-shot learning. To resolve this, we propose the consistency-regularized generative adversarial network (Cr-GAN), a novel framework designed to synthesize diverse, high-fidelity samples even when trained under these severe data limitations. Cr-GAN introduces a dual-branch discriminator that decouples adversarial training from representation learning. This architecture enables a channel-wise feature interpolation strategy to create novel latent features, complemented by a dual-domain cycle consistency mechanism that ensures semantic integrity. Our Cr-GAN framework is adaptable to various GAN architectures, and its synthesized data effectively boosts multiple SSL algorithms. Extensive experiments on the MSTAR and SRSDD datasets validate our approach, with Cr-GAN achieving a highly competitive accuracy of 71.21% and 51.64%, respectively, in the 8-shot setting, significantly outperforming leading baselines, while requiring only ~5 of the parameters of state-of-the-art diffusion models. Code is available at: https://github.com/yikuizhai/Cr-GAN.
[108] Performance-guided Reinforced Active Learning for Object Detection
Zhixuan Liang, Xingyu Zeng, Rui Zhao, Ping Luo
Main category: cs.CV
TL;DR: MGRAL introduces a reinforcement learning-based active learning method for object detection that uses mAP improvement as reward to directly optimize sample selection for downstream task performance.
Details
Motivation: Current active learning approaches focus on data distribution or intrinsic information content rather than directly correlating with downstream task performance metrics like mAP in object detection.Method: Uses reinforcement learning with policy gradient optimization, where mAP improvement serves as reward. Employs expected model output changes as informativeness measure and uses fast look-up tables for efficient mAP estimation.
Result: Achieves highest active learning curve on PASCAL VOC and COCO benchmarks with convincing visualizations, establishing new paradigm in reinforcement learning-driven active object detection.
Conclusion: MGRAL successfully addresses the combinatorial explosion challenge in batch selection and non-differentiable correlation between model performance and selected batches, offering a performance-guided approach to active learning for object detection.
Abstract: Active learning (AL) strategies aim to train high-performance models with minimal labeling efforts, only selecting the most informative instances for annotation. Current approaches to evaluating data informativeness predominantly focus on the data’s distribution or intrinsic information content and do not directly correlate with downstream task performance, such as mean average precision (mAP) in object detection. Thus, we propose Performance-guided (i.e. mAP-guided) Reinforced Active Learning for Object Detection (MGRAL), a novel approach that leverages the concept of expected model output changes as informativeness. To address the combinatorial explosion challenge of batch sample selection and the non-differentiable correlation between model performance and selected batches, MGRAL skillfully employs a reinforcement learning-based sampling agent that optimizes selection using policy gradient with mAP improvement as reward. Moreover, to reduce the computational overhead of mAP estimation with unlabeled samples, MGRAL utilizes an unsupervised way with fast look-up tables, ensuring feasible deployment. We evaluate MGRAL’s active learning performance on detection tasks over PASCAL VOC and COCO benchmarks. Our approach demonstrates the highest AL curve with convincing visualizations, establishing a new paradigm in reinforcement learning-driven active object detection.
[109] Beyond Visual Safety: Jailbreaking Multimodal Large Language Models for Harmful Image Generation via Semantic-Agnostic Inputs
Mingyu Yu, Lana Liu, Zhehao Zhao, Wei Wang, Sujuan Qin
Main category: cs.CV
TL;DR: BVS is a novel jailbreaking framework that probes visual safety boundaries of MLLMs using image-text pairs, achieving 98.21% success rate against GPT-5.
Details
Motivation: Existing security schemes for Multimodal Large Language Models (MLLMs) insufficiently investigate visual safety boundaries, leaving critical vulnerabilities unexplored at the intersection of textual and visual safety.Method: BVS employs a “reconstruction-then-generation” strategy using neutralized visual splicing and inductive recomposition to decouple malicious intent from raw inputs, inducing MLLMs to generate harmful images.
Result: BVS achieves a remarkable 98.21% jailbreak success rate against GPT-5 (12 January 2026 release), exposing critical vulnerabilities in current MLLMs’ visual safety alignment.
Conclusion: The framework successfully probes and exposes significant visual safety vulnerabilities in state-of-the-art MLLMs, demonstrating the urgent need for improved visual safety alignment mechanisms.
Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has introduced complex security challenges, particularly at the intersection of textual and visual safety. While existing schemes have explored the security vulnerabilities of MLLMs, the investigation into their visual safety boundaries remains insufficient. In this paper, we propose Beyond Visual Safety (BVS), a novel image-text pair jailbreaking framework specifically designed to probe the visual safety boundaries of MLLMs. BVS employs a “reconstruction-then-generation” strategy, leveraging neutralized visual splicing and inductive recomposition to decouple malicious intent from raw inputs, thereby leading MLLMs to be induced into generating harmful images. Experimental results demonstrate that BVS achieves a remarkable jailbreak success rate of 98.21% against GPT-5 (12 January 2026 release). Our findings expose critical vulnerabilities in the visual safety alignment of current MLLMs.
[110] Enhanced LULC Segmentation via Lightweight Model Refinements on ALOS-2 SAR Data
Ali Caglayan, Nevrez Imamoglu, Toru Kouyama
Main category: cs.CV
TL;DR: This paper improves national-scale land-use/land-cover semantic segmentation using ALOS-2 SAR data over Japan with three lightweight refinements to address common SAR dense-prediction problems without increasing pipeline complexity.
Details
Motivation: The paper aims to address common failure modes in SAR dense-prediction tasks: boundary over-smoothing, missed thin/slender structures, and rare-class degradation under long-tailed label distributions, while working with single-polarization (HH) SAR data for national-scale LULC mapping over Japan.Method: Three lightweight refinements: (1) injecting high-resolution features into multi-scale decoding, (2) a progressive refine-up head that alternates convolutional refinement and stepwise upsampling, and (3) an α-scale factor that tempers class reweighting within a focal+dice objective. Builds on SAR-W-MixMAE self-supervised pretraining.
Result: The model yields consistent improvements on the Japan-wide ALOS-2 LULC benchmark, particularly for under-represented classes, and improves water detection across standard evaluation metrics.
Conclusion: The proposed lightweight refinements effectively address SAR dense-prediction challenges without increasing pipeline complexity, demonstrating improved performance for national-scale LULC semantic segmentation and binary water detection tasks using ALOS-2 SAR data.
Abstract: This work focuses on national-scale land-use/land-cover (LULC) semantic segmentation using ALOS-2 single-polarization (HH) SAR data over Japan, together with a companion binary water detection task. Building on SAR-W-MixMAE self-supervised pretraining [1], we address common SAR dense-prediction failure modes, boundary over-smoothing, missed thin/slender structures, and rare-class degradation under long-tailed labels, without increasing pipeline complexity. We introduce three lightweight refinements: (i) injecting high-resolution features into multi-scale decoding, (ii) a progressive refine-up head that alternates convolutional refinement and stepwise upsampling, and (iii) an $α$-scale factor that tempers class reweighting within a focal+dice objective. The resulting model yields consistent improvements on the Japan-wide ALOS-2 LULC benchmark, particularly for under-represented classes, and improves water detection across standard evaluation metrics.
[111] Zero-Shot Product Attribute Labeling with Vision-Language Models: A Three-Tier Evaluation Framework
Shubham Shukla, Kunal Sonalkar
Main category: cs.CV
TL;DR: VLMs outperform traditional methods for fashion attribute prediction but struggle with detecting when attributes are applicable vs. not applicable, with efficient models offering 90% of flagship performance at lower cost.
Details
Motivation: Fashion retail applications need fine-grained attribute prediction for catalog enrichment, search, and recommendations. VLMs offer zero-shot capabilities but their performance on multi-attribute fashion tasks with conditional attributes (where some attributes may not be applicable) hasn't been systematically evaluated.Method: Introduced a three-tier evaluation framework: (1) overall task performance across all classes including NA (not applicable), (2) attribute applicability detection, and (3) fine-grained classification when attributes are determinable. Benchmarked 9 VLMs across flagship, efficient, and ultra-efficient tiers against classifiers trained on Fashion-CLIP embeddings using DeepFashion-MultiModal dataset with 5,000 images across 18 attributes.
Result: Zero-shot VLMs achieved 64.0% macro-F1 (3x improvement over logistic regression on Fashion-CLIP). VLMs excel at fine-grained classification (70.8% F1) but struggle with applicability detection (34.1% NA-F1). Efficient models achieve over 90% of flagship performance at lower cost.
Conclusion: The diagnostic framework helps identify whether errors come from visibility detection or classification, guiding targeted improvements. Efficient VLMs offer practical deployment paths for fashion applications, though applicability detection remains a key bottleneck.
Abstract: Fine-grained attribute prediction is essential for fashion retail applications including catalog enrichment, visual search, and recommendation systems. Vision-Language Models (VLMs) offer zero-shot prediction without task-specific training, yet their systematic evaluation on multi-attribute fashion tasks remains underexplored. A key challenge is that fashion attributes are often conditional. For example, “outer fabric” is undefined when no outer garment is visible. This requires models to detect attribute applicability before attempting classification. We introduce a three-tier evaluation framework that decomposes this challenge: (1) overall task performance across all classes (including NA class: suggesting attribute is not applicable) for all attributes, (2) attribute applicability detection, and (3) fine-grained classification when attributes are determinable. Using DeepFashion-MultiModal, which explicitly defines NA (meaning attribute doesn’t exist or is not visible) within attribute label spaces, we benchmark nine VLMs spanning flagship (GPT-5, Gemini 2.5 Pro), efficient (GPT-5 Mini, Gemini 2.5 Flash), and ultra-efficient tiers (GPT-5 Nano, Gemini 2.5 Flash-Lite) against classifiers trained on pretrained Fashion-CLIP embeddings on 5,000 images across 18 attributes. Our findings reveal that: (1) zero-shot VLMs achieve 64.0% macro-F1, a threefold improvement over logistic regression on pretrained Fashion-CLIP embeddings; (2) VLMs excel at fine-grained classification (Tier 3: 70.8% F1) but struggle with applicability detection (Tier 2: 34.1% NA-F1), identifying a key bottleneck; (3) efficient models achieve over 90% of flagship performance at lower cost, offering practical deployment paths. This diagnostic framework enables practitioners to pinpoint whether errors stem from visibility detection or classification, guiding targeted improvements for production systems.
[112] VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
Chenglin Li, Qianglong Chen, Feng Han, Yikun Wang, Xingxi Yin, Yan Gong, Ruilin Li, Yin Zhang, Jiaqi Wang
Main category: cs.CV
TL;DR: VideoThinker: An agentic Video LLM trained on synthetic tool interaction trajectories for long-form video understanding, outperforming caption-only agents and video baselines.
Details
Motivation: Existing Video LLMs rely on static reasoning over uniformly sampled frames, leading to weak temporal localization and information loss in long videos. Agentic tools (temporal retrieval, spatial/temporal zoom) could help but require training data from models that already have strong long-form video understanding, creating a circular dependency.Method: 1) Convert videos into rich captions, 2) Use a powerful agentic language model to generate multi-step tool use sequences in caption space, 3) Ground trajectories back to video by replacing captions with corresponding frames, creating synthetic interleaved video-tool reasoning dataset, 4) Train VideoThinker on this synthetic agentic dataset.
Result: VideoThinker significantly outperforms both caption-only language model agents and strong video model baselines across long-video benchmarks, demonstrating effectiveness of tool-augmented synthetic data and adaptive retrieval/zoom reasoning.
Conclusion: The approach successfully breaks the circular dependency in agentic video understanding data creation, enabling training of models with dynamic reasoning, adaptive temporal exploration, and multi-step tool use capabilities for long-form video understanding.
Abstract: Long-form video understanding remains a fundamental challenge for current Video Large Language Models. Most existing models rely on static reasoning over uniformly sampled frames, which weakens temporal localization and leads to substantial information loss in long videos. Agentic tools such as temporal retrieval, spatial zoom, and temporal zoom offer a natural way to overcome these limitations by enabling adaptive exploration of key moments. However, constructing agentic video understanding data requires models that already possess strong long-form video comprehension, creating a circular dependency. We address this challenge with VideoThinker, an agentic Video Large Language Model trained entirely on synthetic tool interaction trajectories. Our key idea is to convert videos into rich captions and employ a powerful agentic language model to generate multi-step tool use sequences in caption space. These trajectories are subsequently grounded back to video by replacing captions with the corresponding frames, yielding a large-scale interleaved video and tool reasoning dataset without requiring any long-form understanding from the underlying model. Training on this synthetic agentic dataset equips VideoThinker with dynamic reasoning capabilities, adaptive temporal exploration, and multi-step tool use. Remarkably, VideoThinker significantly outperforms both caption-only language model agents and strong video model baselines across long-video benchmarks, demonstrating the effectiveness of tool augmented synthetic data and adaptive retrieval and zoom reasoning for long-form video understanding.
[113] FAIR-ESI: Feature Adaptive Importance Refinement for Electrophysiological Source Imaging
Linyong Zou, Liang Zhang, Xiongfei Wang, Jia-Hong Gao, Yi Sun, Shurong Sheng, Kuntao Xiao, Wanli Yang, Pengfei Teng, Guoming Luan, Zhao Lv, Zikang Xu
Main category: cs.CV
TL;DR: FAIR-ESI is a novel electrophysiological source imaging framework that adaptively refines feature importance across multiple views (spectral, temporal, patch-wise) to improve brain disorder diagnosis.
Details
Motivation: Accurate feature selection and refinement is a central challenge for precise electrophysiological source imaging (ESI) in brain disorder diagnosis, despite promising results from existing model-based optimization and deep learning methods.Method: FAIR-ESI adaptively refines feature importance across three different views: 1) FFT-based spectral feature refinement, 2) weighted temporal feature refinement, and 3) self-attention-based patch-wise feature refinement.
Result: Extensive experiments on two simulation datasets with diverse configurations and two real-world clinical datasets validate the framework’s efficacy.
Conclusion: FAIR-ESI has potential to advance brain disorder diagnosis and offer new insights into brain function through improved electrophysiological source imaging.
Abstract: An essential technique for diagnosing brain disorders is electrophysiological source imaging (ESI). While model-based optimization and deep learning methods have achieved promising results in this field, the accurate selection and refinement of features remains a central challenge for precise ESI. This paper proposes FAIR-ESI, a novel framework that adaptively refines feature importance across different views, including FFT-based spectral feature refinement, weighted temporal feature refinement, and self-attention-based patch-wise feature refinement. Extensive experiments on two simulation datasets with diverse configurations and two real-world clinical datasets validate our framework’s efficacy, highlighting its potential to advance brain disorder diagnosis and offer new insights into brain function.
[114] Sub-Region-Aware Modality Fusion and Adaptive Prompting for Multi-Modal Brain Tumor Segmentation
Shadi Alijani, Fereshteh Aghaee Meibodi, Homayoun Najjaran
Main category: cs.CV
TL;DR: A novel framework for adapting foundation models to multi-modal medical imaging using sub-region-aware modality attention and adaptive prompt engineering, achieving superior brain tumor segmentation performance.
Details
Motivation: Existing foundation models struggle with effective multi-modal fusion and adaptation to heterogeneous pathological tissues in medical imaging, creating a critical unresolved challenge.Method: Two key innovations: 1) sub-region-aware modality attention that learns optimal modality combinations for each tumor sub-region, and 2) adaptive prompt engineering that leverages foundation model capabilities to refine segmentation accuracy.
Result: Validated on BraTS 2020 brain tumor segmentation dataset, significantly outperforms baseline methods, especially in the challenging necrotic core sub-region.
Conclusion: Provides a principled and effective approach to multi-modal fusion and prompting, paving the way for more accurate and robust foundation model-based solutions in medical imaging.
Abstract: The successful adaptation of foundation models to multi-modal medical imaging is a critical yet unresolved challenge. Existing models often struggle to effectively fuse information from multiple sources and adapt to the heterogeneous nature of pathological tissues. To address this, we introduce a novel framework for adapting foundation models to multi-modal medical imaging, featuring two key technical innovations: sub-region-aware modality attention and adaptive prompt engineering. The attention mechanism enables the model to learn the optimal combination of modalities for each tumor sub-region, while the adaptive prompting strategy leverages the inherent capabilities of foundation models to refine segmentation accuracy. We validate our framework on the BraTS 2020 brain tumor segmentation dataset, demonstrating that our approach significantly outperforms baseline methods, particularly in the challenging necrotic core sub-region. Our work provides a principled and effective approach to multi-modal fusion and prompting, paving the way for more accurate and robust foundation model-based solutions in medical imaging.
[115] Breaking the Resolution Barrier: Arbitrary-resolution Deep Image Steganography Framework
Xinjue Hu, Chi Wang, Boyu Wang, Xiang Zhang, Zhenshan Tan, Zhangjie Fu
Main category: cs.CV
TL;DR: ARDIS enables arbitrary resolution image steganography by shifting from discrete mapping to reference-guided continuous signal reconstruction, allowing secret images to be hidden and recovered at their original resolutions without prior resampling.
Details
Motivation: Current deep image steganography methods require secret and cover images to have the same resolution, forcing resampling that causes detail loss during recovery and preventing recovery to original resolution when resolution is unknown.Method: 1) Frequency Decoupling Architecture: Disentangles secret into resolution-aligned global basis and resolution-agnostic high-frequency latent for hiding. 2) Latent-Guided Implicit Reconstructor: Uses recovered detail latent to modulate implicit function for high-frequency residual reconstruction. 3) Implicit Resolution Coding: Encodes discrete resolution values as dense feature maps for blind recovery.
Result: ARDIS significantly outperforms state-of-the-art methods in both invisibility and cross-resolution recovery fidelity, enabling faithful restoration of original details without resolution constraints.
Conclusion: ARDIS successfully addresses resolution limitations in deep image steganography by introducing a continuous signal reconstruction paradigm that enables arbitrary resolution hiding and blind recovery while preserving image details.
Abstract: Deep image steganography (DIS) has achieved significant results in capacity and invisibility. However, current paradigms enforce the secret image to maintain the same resolution as the cover image during hiding and revealing. This leads to two challenges: secret images with inconsistent resolutions must undergo resampling beforehand which results in detail loss during recovery, and the secret image cannot be recovered to its original resolution when the resolution value is unknown. To address these, we propose ARDIS, the first Arbitrary Resolution DIS framework, which shifts the paradigm from discrete mapping to reference-guided continuous signal reconstruction. Specifically, to minimize the detail loss caused by resolution mismatch, we first design a Frequency Decoupling Architecture in hiding stage. It disentangles the secret into a resolution-aligned global basis and a resolution-agnostic high-frequency latent to hide in a fixed-resolution cover. Second, for recovery, we propose a Latent-Guided Implicit Reconstructor to perform deterministic restoration. The recovered detail latent code modulates a continuous implicit function to accurately query and render high-frequency residuals onto the recovered global basis, ensuring faithful restoration of original details. Furthermore, to achieve blind recovery, we introduce an Implicit Resolution Coding strategy. By transforming discrete resolution values into dense feature maps and hiding them in the redundant space of the feature domain, the reconstructor can correctly decode the secret’s resolution directly from the steganographic representation. Experimental results demonstrate that ARDIS significantly outperforms state-of-the-art methods in both invisibility and cross-resolution recovery fidelity.
[116] White-Box mHC: Electromagnetic Spectrum-Aware and Interpretable Stream Interactions for Hyperspectral Image Classification
Yimin Zhu, Lincoln Linlin Xu, Zhengsen Xu, Zack Dewis, Mabel Heffring, Saeid Taleghanidoozdoozan, Motasem Alkayid, Quinn Ledingham, Megan Greenwood
Main category: cs.CV
TL;DR: ES-mHC introduces a physically-inspired white-box framework for hyperspectral image classification that explicitly models electromagnetic spectrum interactions using structured matrices, improving interpretability while maintaining performance.
Details
Motivation: Most deep learning models for hyperspectral image classification rely on opaque spectral-spatial feature mixing, which limits interpretability and prevents understanding of internal decision mechanisms. There's a need for more transparent models that reveal how electromagnetic spectrum groupings interact during classification.Method: ES-mHC uses a hyper-connection framework that explicitly models interactions among different electromagnetic spectrum groupings (residual streams) using structured, directional matrices. It separates feature representation from interaction structure, promoting spectrum grouping specialization and reducing redundancy while exposing internal information flow that can be visualized and spatially analyzed.
Result: The learned hyper-connection matrices exhibit coherent spatial patterns and asymmetric interaction behaviors, providing mechanistic insight into model internal dynamics. Increasing the expansion rate accelerates the emergence of structured interaction patterns. The framework transforms HSIC from black-box prediction to structurally transparent learning.
Conclusion: ES-mHC successfully creates a physically-inspired white-box approach for hyperspectral image classification that maintains performance while providing interpretability through explicit modeling of electromagnetic spectrum interactions, enabling visualization and analysis of internal decision mechanisms.
Abstract: In hyperspectral image classification (HSIC), most deep learning models rely on opaque spectral-spatial feature mixing, limiting their interpretability and hindering understanding of internal decision mechanisms. We present physical spectrum-aware white-box mHC, named ES-mHC, a hyper-connection framework that explicitly models interactions among different electromagnetic spectrum groupings (residual stream in mHC) interactions using structured, directional matrices. By separating feature representation from interaction structure, ES-mHC promotes electromagnetic spectrum grouping specialization, reduces redundancy, and exposes internal information flow that can be directly visualized and spatially analyzed. Using hyperspectral image classification as a representative testbed, we demonstrate that the learned hyper-connection matrices exhibit coherent spatial patterns and asymmetric interaction behaviors, providing mechanistic insight into the model internal dynamics. Furthermore, we find that increasing the expansion rate accelerates the emergence of structured interaction patterns. These results suggest that ES-mHC transforms HSIC from a purely black-box prediction task into a structurally transparent, partially white-box learning process.
[117] Atlas-Assisted Segment Anything Model for Fetal Brain MRI (FeTal-SAM)
Qi Zeng, Weide Liu, Bo Li, Ryne Didier, P. Ellen Grant, Davood Karimi
Main category: cs.CV
TL;DR: FeTal-SAM adapts Segment Anything Model for fetal brain MRI segmentation using atlas-based prompts, enabling flexible segmentation without retraining for different label definitions.
Details
Motivation: Traditional deep learning methods require large annotated datasets for fixed labels and lack flexibility when clinical/research needs change. There's also limited insight into whether segmentations are driven by genuine image contrast or learned spatial priors.Method: Integrates atlas-based prompts with foundation-model principles. Uses multi-atlas registration to generate spatially aligned label templates as dense prompts, plus bounding-box prompts for SAM’s decoder. Performs binary segmentation per structure and fuses results to reconstruct full 3D volumes.
Result: Achieves Dice scores comparable to state-of-the-art baselines for well-contrasted structures (cortical plate, cerebellum) across gestational ages on dHCP and in-house datasets. Slightly lower accuracy for subtle, low-contrast structures (hippocampus, amygdala). Maintains flexibility to segment any user-specified anatomy.
Conclusion: FeTal-SAM shows potential as a general-purpose segmentation model without exhaustive retraining, representing a promising step toward clinically adaptable fetal brain MRI analysis tools.
Abstract: This paper presents FeTal-SAM, a novel adaptation of the Segment Anything Model (SAM) tailored for fetal brain MRI segmentation. Traditional deep learning methods often require large annotated datasets for a fixed set of labels, making them inflexible when clinical or research needs change. By integrating atlas-based prompts and foundation-model principles, FeTal-SAM addresses two key limitations in fetal brain MRI segmentation: (1) the need to retrain models for varying label definitions, and (2) the lack of insight into whether segmentations are driven by genuine image contrast or by learned spatial priors. We leverage multi-atlas registration to generate spatially aligned label templates that serve as dense prompts, alongside a bounding-box prompt, for SAM’s segmentation decoder. This strategy enables binary segmentation on a per-structure basis, which is subsequently fused to reconstruct the full 3D segmentation volumes. Evaluations on two datasets, the dHCP dataset and an in-house dataset demonstrate FeTal-SAM’s robust performance across gestational ages. Notably, it achieves Dice scores comparable to state-of-the-art baselines which were trained for each dataset and label definition for well-contrasted structures like cortical plate and cerebellum, while maintaining the flexibility to segment any user-specified anatomy. Although slightly lower accuracy is observed for subtle, low-contrast structures (e.g., hippocampus, amygdala), our results highlight FeTal-SAM’s potential to serve as a general-purpose segmentation model without exhaustive retraining. This method thus constitutes a promising step toward clinically adaptable fetal brain MRI analysis tools.
[118] LL-GaussianMap: Zero-shot Low-Light Image Enhancement via 2D Gaussian Splatting Guided Gain Maps
Yuhan Chen, Ying Fang, Guofa Li, Wenxuan Yu, Yicui Shi, Jingrui Zhang, Kefei Qian, Wenbo Chu, Keqiang Li
Main category: cs.CV
TL;DR: LL-GaussianMap introduces the first unsupervised framework using 2D Gaussian Splatting for low-light image enhancement, formulating enhancement as gain map generation guided by explicit Gaussian primitives.
Details
Motivation: Existing low-light enhancement methods operate in pixel domain or use implicit features, neglecting intrinsic geometric structural priors. 2D Gaussian Splatting (2DGS) has superior structural fitting but hasn't been explored for low-level vision tasks.Method: Two-stage approach: 1) High-fidelity structural reconstruction using 2DGS primitives, 2) Data-driven enhancement dictionary coefficients rendered via Gaussian splatting rasterization through unified enhancement module. Formulates enhancement as gain map generation guided by 2DGS.
Result: Achieves superior enhancement performance with extremely low storage footprint, effectively preserves edges and suppresses artifacts, demonstrating effectiveness of explicit Gaussian representations for image enhancement.
Conclusion: LL-GaussianMap successfully bridges the gap by incorporating 2DGS into low-light enhancement, showing that explicit Gaussian representations can effectively preserve structural priors while achieving high-quality enhancement with minimal storage requirements.
Abstract: Significant progress has been made in low-light image enhancement with respect to visual quality. However, most existing methods primarily operate in the pixel domain or rely on implicit feature representations. As a result, the intrinsic geometric structural priors of images are often neglected. 2D Gaussian Splatting (2DGS) has emerged as a prominent explicit scene representation technique characterized by superior structural fitting capabilities and high rendering efficiency. Despite these advantages, the utilization of 2DGS in low-level vision tasks remains unexplored. To bridge this gap, LL-GaussianMap is proposed as the first unsupervised framework incorporating 2DGS into low-light image enhancement. Distinct from conventional methodologies, the enhancement task is formulated as a gain map generation process guided by 2DGS primitives. The proposed method comprises two primary stages. First, high-fidelity structural reconstruction is executed utilizing 2DGS. Then, data-driven enhancement dictionary coefficients are rendered via the rasterization mechanism of Gaussian splatting through an innovative unified enhancement module. This design effectively incorporates the structural perception capabilities of 2DGS into gain map generation, thereby preserving edges and suppressing artifacts during enhancement. Additionally, the reliance on paired data is circumvented through unsupervised learning. Experimental results demonstrate that LL-GaussianMap achieves superior enhancement performance with an extremely low storage footprint, highlighting the effectiveness of explicit Gaussian representations for image enhancement.
[119] LL-GaussianImage: Efficient Image Representation for Zero-shot Low-Light Enhancement with 2D Gaussian Splatting
Yuhan Chen, Wenxuan Yu, Guofa Li, Yijun Xu, Ying Fang, Yicui Shi, Long Cao, Wenbo Chu, Keqiang Li
Main category: cs.CV
TL;DR: LL-GaussianImage is a zero-shot unsupervised framework for low-light enhancement directly in 2DGS compressed domain, avoiding decompression-enhancement-recompression pipeline.
Details
Motivation: Existing low-light enhancement works in pixel domain, requiring decompression of 2DGS-compressed images which is inefficient and causes secondary degradation. Need for direct enhancement in compressed representation domain.Method: 1) Semantic-guided Mixture-of-Experts framework using rendered images as guidance for dynamic adaptive transformations in sparse 2DGS attribute space. 2) Multi-objective collaborative loss system for smoothness and fidelity constraints. 3) Two-stage optimization: single-scale reconstruction for base representation accuracy and network robustness enhancement.
Result: Achieves high-quality low-light enhancement while maintaining high compression ratios. Validates feasibility and superiority of direct processing in compressed representation domain.
Conclusion: LL-GaussianImage enables compression-as-enhancement without full decompression, establishing a new paradigm for efficient low-light enhancement in 2DGS compressed domain.
Abstract: 2D Gaussian Splatting (2DGS) is an emerging explicit scene representation method with significant potential for image compression due to high fidelity and high compression ratios. However, existing low-light enhancement algorithms operate predominantly within the pixel domain. Processing 2DGS-compressed images necessitates a cumbersome decompression-enhancement-recompression pipeline, which compromises efficiency and introduces secondary degradation. To address these limitations, we propose LL-GaussianImage, the first zero-shot unsupervised framework designed for low-light enhancement directly within the 2DGS compressed representation domain. Three primary advantages are offered by this framework. First, a semantic-guided Mixture-of-Experts enhancement framework is designed. Dynamic adaptive transformations are applied to the sparse attribute space of 2DGS using rendered images as guidance to enable compression-as-enhancement without full decompression to a pixel grid. Second, a multi-objective collaborative loss function system is established to strictly constrain smoothness and fidelity during enhancement, suppressing artifacts while improving visual quality. Third, a two-stage optimization process is utilized to achieve reconstruction-as-enhancement. The accuracy of the base representation is ensured through single-scale reconstruction and network robustness is enhanced. High-quality enhancement of low-light images is achieved while high compression ratios are maintained. The feasibility and superiority of the paradigm for direct processing within the compressed representation domain are validated through experimental results.
[120] Diffusion Model-Based Data Augmentation for Enhanced Neuron Segmentation
Liuyun Jiang, Yanchao Zhang, Jinyue Guo, Yizhuo Lu, Ruining Zhou, Hua Han
Main category: cs.CV
TL;DR: A diffusion-based data augmentation framework for neuron segmentation in EM images that generates diverse, structurally plausible image-label pairs to address limited training data.
Details
Motivation: Current deep learning methods for neuron segmentation in EM require large-scale annotated data, which is time-consuming to obtain. Traditional augmentation methods produce highly correlated samples lacking structural diversity.Method: Proposes a resolution-aware conditional diffusion model with multi-scale conditioning and EM resolution priors for voxel-level image synthesis from 3D masks, plus a biology-guided mask remodeling module for enhanced structural realism.
Result: Improves ARAND metric by 32.1% on AC3 dataset and 30.7% on AC4 dataset under low-annotation regimes when combined with two different post-processing methods.
Conclusion: The diffusion-based data augmentation framework effectively enriches training sets and improves neuron segmentation performance, especially in low-annotation scenarios.
Abstract: Neuron segmentation in electron microscopy (EM) aims to reconstruct the complete neuronal connectome; however, current deep learning-based methods are limited by their reliance on large-scale training data and extensive, time-consuming manual annotations. Traditional methods augment the training set through geometric and photometric transformations; however, the generated samples remain highly correlated with the original images and lack structural diversity. To address this limitation, we propose a diffusion-based data augmentation framework capable of generating diverse and structurally plausible image-label pairs for neuron segmentation. Specifically, the framework employs a resolution-aware conditional diffusion model with multi-scale conditioning and EM resolution priors to enable voxel-level image synthesis from 3D masks. It further incorporates a biology-guided mask remodeling module that produces augmented masks with enhanced structural realism. Together, these components effectively enrich the training set and improve segmentation performance. On the AC3 and AC4 datasets under low-annotation regimes, our method improves the ARAND metric by 32.1% and 30.7%, respectively, when combined with two different post-processing methods. Our code is available at https://github.com/HeadLiuYun/NeuroDiff.
[121] Assessing Situational and Spatial Awareness of VLMs with Synthetically Generated Video
Pascal Benschop, Justin Dauwels, Jan van Gemert
Main category: cs.CV
TL;DR: VLMs struggle with spatial reasoning tasks involving subtle temporal/geometric cues, performing near chance on a new synthetic benchmark testing situational and spatial awareness through minimal video pairs.
Details
Motivation: Spatial reasoning in VLMs remains fragile when semantics depend on subtle temporal or geometric cues, highlighting the need for better benchmarks to diagnose these weaknesses.Method: Created a synthetic benchmark with minimal video pairs to test three challenges: distinguishing violence from benign activity, binding assailant roles across viewpoints, and judging fine-grained trajectory alignment. Evaluated recent VLMs in training-free setting.
Result: VLMs perform only slightly above chance across all tasks. Simple color cues partly reduce assailant role confusions but don’t resolve underlying spatial reasoning weaknesses.
Conclusion: The benchmark reveals fundamental spatial reasoning limitations in VLMs and provides reproducible diagnostics to explore lightweight spatial priors as complements to large-scale pretraining.
Abstract: Spatial reasoning in vision language models (VLMs) remains fragile when semantics hinge on subtle temporal or geometric cues. We introduce a synthetic benchmark that probes two complementary skills: situational awareness (recognizing whether an interaction is harmful or benign) and spatial awareness (tracking who does what to whom, and reasoning about relative positions and motion). Through minimal video pairs, we test three challenges: distinguishing violence from benign activity, binding assailant roles across viewpoints, and judging fine-grained trajectory alignment. While we evaluate recent VLMs in a training-free setting, the benchmark is applicable to any video classification model. Results show performance only slightly above chance across tasks. A simple aid, stable color cues, partly reduces assailant role confusions but does not resolve the underlying weakness. By releasing data and code, we aim to provide reproducible diagnostics and seed exploration of lightweight spatial priors to complement large-scale pretraining.
[122] A Mobile Application for Flower Recognition System Based on Convolutional Neural Networks
Mustafa Yurdakul, Enes Ayan, Fahrettin Horasan, Sakir Tasdemir
Main category: cs.CV
TL;DR: A mobile app using CNN models (MobileNet, DenseNet121, Xception) was developed for flower classification, with DenseNet121+SGD achieving 95.84% accuracy.
Details
Motivation: Flowers have many uses but identifying them requires expert knowledge, which isn't always accessible. A mobile app could provide non-specialists with quick flower identification.Method: Developed a mobile application using three CNN models (MobileNet, DenseNet121, Xception) and evaluated them with seven different optimization algorithms to find the best model for mobile deployment.
Result: DenseNet-121 with stochastic gradient descent (SGD) optimization achieved the best performance: 95.84% accuracy, 96.00% precision, recall, and F1-score.
Conclusion: CNNs can be effectively used for flower classification in mobile applications, with DenseNet121+SGD being the most suitable model for this purpose.
Abstract: A convolutional neural network (CNN) is a deep learning algorithm that has been specifically designed for computer vision applications. The CNNs proved successful in handling the increasing amount of data in many computer vision problems, where classical machine learning algorithms were insufficient. Flowers have many uses in our daily lives, from decorating to making medicines to detoxifying the environment. Identifying flower types requires expert knowledge. However, accessing experts at any time and in any location may not always be feasible. In this study a mobile application based on CNNs was developed to recognize different types of flowers to provide non-specialists with quick and easy access to information about flower types. The study employed three distinct CNN models, namely MobileNet, DenseNet121, and Xception, to determine the most suitable model for the mobile application. The classification performances of the models were evaluated by training them with seven different optimization algorithms. The DenseNet-121 architecture, which uses the stochastic gradient descent (SGD) optimization algorithm, was the most successful, achieving 95.84 % accuracy, 96.00% precision, recall, and F1-score. This result shows that CNNs can be used for flower classification in mobile applications.
[123] Beyond Off-the-Shelf Models: A Lightweight and Accessible Machine Learning Pipeline for Ecologists Working with Image Data
Clare Chemery, Hendrik Edelhoff, Ludwig Bothmann
Main category: cs.CV
TL;DR: A lightweight ML pipeline for ecologists to build custom image classifiers without advanced expertise, demonstrated on red deer age/sex classification with 90-96% accuracy.
Details
Motivation: To lower the barrier for ecologists to apply machine learning to image classification tasks, enabling them to move beyond off-the-shelf models and create tailored solutions for specific ecological research questions and local datasets.Method: A lightweight experimentation pipeline combining command-line interface for preprocessing/training/evaluation with graphical interface for annotation/error analysis/model comparison. Demonstrated on red deer classification using 4352 expert-labeled cropped images from camera traps, training multiple backbone architectures with various parameters and data augmentation strategies.
Result: Best-performing models achieved 90.77% accuracy for age classification and 96.15% for sex classification of red deer from camera trap images, demonstrating reliable demographic classification with limited data.
Conclusion: The framework provides ecologists with an accessible tool for developing ML models tailored to specific research questions, enabling broader adoption of ML in wildlife monitoring and demographic analysis, even for narrow, well-defined ecological problems with limited data.
Abstract: We introduce a lightweight experimentation pipeline designed to lower the barrier for applying machine learning (ML) methods for classifying images in ecological research. We enable ecologists to experiment with ML models independently, thus they can move beyond off-the-shelf models and generate insights tailored to local datasets and specific classification tasks and target variables. Our tool combines a simple command-line interface for preprocessing, training, and evaluation with a graphical interface for annotation, error analysis, and model comparison. This design enables ecologists to build and iterate on compact, task-specific classifiers without requiring advanced ML expertise. As a proof of concept, we apply the pipeline to classify red deer (Cervus elaphus) by age and sex from 3392 camera trap images collected in the Veldenstein Forest, Germany. Using 4352 cropped images containing individual deer labeled by experts, we trained and evaluated multiple backbone architectures with a wide variety of parameters and data augmentation strategies. Our best-performing models achieved 90.77% accuracy for age classification and 96.15% for sex classification. These results demonstrate that reliable demographic classification is feasible even with limited data to answer narrow, well-defined ecological problems. More broadly, the framework provides ecologists with an accessible tool for developing ML models tailored to specific research questions, paving the way for broader adoption of ML in wildlife monitoring and demographic analysis.
[124] Towards Realistic Remote Sensing Dataset Distillation with Discriminative Prototype-guided Diffusion
Yonghao Xu, Pedram Ghamisi, Qihao Weng
Main category: cs.CV
TL;DR: First application of dataset distillation to remote sensing imagery using diffusion models with classifier guidance and latent clustering to create compact, representative training datasets.
Details
Motivation: Address two major challenges in deep learning for remote sensing: (1) high storage and computational costs of large datasets, and (2) data leakage risks with sensitive categories. Current reliance on massive training data creates these problems.Method: Train text-to-image diffusion model to condense large remote sensing datasets. Use classifier-driven guidance with classification consistency loss from pre-trained model. Perform latent space clustering to select diverse prototypes as visual style guidance, and use visual language model for aggregated text descriptions.
Result: Experiments on three high-resolution remote sensing scene classification benchmarks show the method can distill realistic and diverse samples for downstream model training.
Conclusion: Successfully introduces dataset distillation to remote sensing, creating compact representative datasets that address storage/computational costs and data privacy concerns while maintaining training effectiveness.
Abstract: Recent years have witnessed the remarkable success of deep learning in remote sensing image interpretation, driven by the availability of large-scale benchmark datasets. However, this reliance on massive training data also brings two major challenges: (1) high storage and computational costs, and (2) the risk of data leakage, especially when sensitive categories are involved. To address these challenges, this study introduces the concept of dataset distillation into the field of remote sensing image interpretation for the first time. Specifically, we train a text-to-image diffusion model to condense a large-scale remote sensing dataset into a compact and representative distilled dataset. To improve the discriminative quality of the synthesized samples, we propose a classifier-driven guidance by injecting a classification consistency loss from a pre-trained model into the diffusion training process. Besides, considering the rich semantic complexity of remote sensing imagery, we further perform latent space clustering on training samples to select representative and diverse prototypes as visual style guidance, while using a visual language model to provide aggregated text descriptions. Experiments on three high-resolution remote sensing scene classification benchmarks show that the proposed method can distill realistic and diverse samples for downstream model training. Code and pre-trained models are available online (https://github.com/YonghaoXu/DPD).
[125] An IoT-Based Smart Plant Monitoring and Irrigation System with Real-Time Environmental Sensing, Automated Alerts, and Cloud Analytics
Abdul Hasib, A. S. M. Ahsanul Sarkar Akib
Main category: cs.CV
TL;DR: IoT-based smart plant monitoring system using ESP32 with environmental sensors, automated irrigation, and cloud analytics reduces water usage by 40% with 92% soil moisture accuracy at $45.20 cost.
Details
Motivation: Traditional farming methods cause water wastage, inconsistent plant growth, and delayed response to environmental changes, while global demand for sustainable agriculture requires intelligent monitoring systems for optimized resource utilization.Method: ESP32 microcontroller collects real-time data from DHT22 (temperature/humidity), HC-SR04 (water level), and soil moisture sensors, with OLED display and buzzer alerts. Data is transmitted to ThingSpeak cloud platform for remote monitoring, historical analysis, and automated alerts.
Result: System maintains optimal soil moisture levels with 92% accuracy, reduces water consumption by approximately 40% compared to conventional methods, and provides comprehensive real-time environmental monitoring through integrated web dashboard.
Conclusion: The $45.20 IoT system offers an affordable, scalable solution for precision agriculture suitable for both small-scale gardening and commercial farming applications, addressing sustainable agriculture needs through intelligent monitoring and resource optimization.
Abstract: The increasing global demand for sustainable agriculture necessitates intelligent monitoring systems that optimize resource utilization and plant health management. Traditional farming methods rely on manual observation and periodic watering, often leading to water wastage, inconsistent plant growth, and delayed response to environmental changes. This paper presents a comprehensive IoT-based smart plant monitoring system that integrates multiple environmental sensors with automated irrigation and cloud analytics. The proposed system utilizes an ESP32 microcontroller to collect real-time data from DHT22 (temperature/humidity), HC-SR04 (water level), and soil moisture sensors, with visual feedback through an OLED display and auditory alerts via a buzzer. All sensor data is wirelessly transmitted to the ThingSpeak cloud platform for remote monitoring, historical analysis, and automated alert generation. Experimental results demonstrate the system’s effectiveness in maintaining optimal soil moisture levels (with 92% accuracy), providing real-time environmental monitoring, and reducing water consumption by approximately 40% compared to conventional irrigation methods. The integrated web dashboard offers comprehensive visualization of plant health parameters, making it suitable for both small-scale gardening and commercial agriculture applications. With a total implementation cost of $45.20, this system provides an affordable, scalable solution for precision agriculture and smart farming.
[126] TinySense: Effective CSI Compression for Scalable and Accurate Wi-Fi Sensing
Toan Gian, Dung T. Tran, Viet Quoc Pham, Francesco Restuccia, Van-Dinh Nguyen
Main category: cs.CV
TL;DR: TinySense: A VQGAN-based compression framework for Wi-Fi human pose estimation that reduces CSI data size while maintaining accuracy, with dynamic bitrate adjustment and Transformer enhancement for robustness.
Details
Motivation: Wi-Fi sensing offers device-free, privacy-preserving human pose estimation but processes large CSI data that strains networking resources. Need efficient compression to enhance scalability.Method: Uses VQGAN-learned codebook to compress CSI data. Employs K-means to dynamically adjust compression bitrates by clustering pre-trained codebook into subsets. Incorporates Transformer to mitigate bitrate loss in unreliable networks. Prototyped on Jetson Nano and Raspberry Pi.
Result: Achieves up to 1.5x higher HPE accuracy (PCK20) at same compression rate. Reduces latency by up to 5x and networking overhead by up to 2.5x compared to state-of-the-art compression schemes.
Conclusion: TinySense provides efficient CSI compression for Wi-Fi-based human sensing, significantly improving scalability while maintaining accuracy and reducing resource consumption.
Abstract: With the growing demand for device-free and privacy-preserving sensing solutions, Wi-Fi sensing has emerged as a promising approach for human pose estimation (HPE). However, existing methods often process vast amounts of channel state information (CSI) data directly, ultimately straining networking resources. This paper introduces TinySense, an efficient compression framework that enhances the scalability of Wi-Fi-based human sensing. Our approach is based on a new vector quantization-based generative adversarial network (VQGAN). Specifically, by leveraging a VQGAN-learned codebook, TinySense significantly reduces CSI data while maintaining the accuracy required for reliable HPE. To optimize compression, we employ the K-means algorithm to dynamically adjust compression bitrates to cluster a large-scale pre-trained codebook into smaller subsets. Furthermore, a Transformer model is incorporated to mitigate bitrate loss, enhancing robustness in unreliable networking conditions. We prototype TinySense on an experimental testbed using Jetson Nano and Raspberry Pi to measure latency and network resource use. Extensive results demonstrate that TinySense significantly outperforms state-of-the-art compression schemes, achieving up to 1.5x higher HPE accuracy score (PCK20) under the same compression rate. It also reduces latency and networking overhead, respectively, by up to 5x and 2.5x. The code repository is available online at here.
[127] A Lightweight Brain-Inspired Machine Learning Framework for Coronary Angiography: Hybrid Neural Representation and Robust Learning Strategies
Jingsong Xia, Siqi Wang
Main category: cs.CV
TL;DR: A brain-inspired lightweight deep learning framework for coronary angiography classification that addresses real-world challenges like class imbalance, label uncertainty, and limited computational resources through selective neural plasticity and attention-modulated loss functions.
Details
Motivation: Real-world coronary angiography images present challenges including complex lesion morphology, severe class imbalance, label uncertainty, and limited computational resources, which hinder conventional deep learning approaches' robustness and generalization in clinical settings.Method: Uses pretrained CNN for lightweight hybrid neural representation, selective neural plasticity training for efficient parameter adaptation, brain-inspired attention-modulated loss (Focal Loss + label smoothing), class-imbalance-aware sampling, and cosine annealing with warm restarts to mimic biological neural regulation.
Result: The brain-inspired lightweight model achieves strong and stable performance in binary coronary angiography classification with competitive accuracy, recall, F1-score, and AUC metrics while maintaining high computational efficiency.
Conclusion: Validates brain-inspired learning mechanisms for lightweight medical image analysis and provides a biologically plausible, deployable solution for intelligent clinical decision support under limited computational resources.
Abstract: Background: Coronary angiography (CAG) is a cornerstone imaging modality for assessing coronary artery disease and guiding interventional treatment decisions. However, in real-world clinical settings, angiographic images are often characterized by complex lesion morphology, severe class imbalance, label uncertainty, and limited computational resources, posing substantial challenges to conventional deep learning approaches in terms of robustness and generalization.Methods: The proposed framework is built upon a pretrained convolutional neural network to construct a lightweight hybrid neural representation. A selective neural plasticity training strategy is introduced to enable efficient parameter adaptation. Furthermore, a brain-inspired attention-modulated loss function, combining Focal Loss with label smoothing, is employed to enhance sensitivity to hard samples and uncertain annotations. Class-imbalance-aware sampling and cosine annealing with warm restarts are adopted to mimic rhythmic regulation and attention allocation mechanisms observed in biological neural systems.Results: Experimental results demonstrate that the proposed lightweight brain-inspired model achieves strong and stable performance in binary coronary angiography classification, yielding competitive accuracy, recall, F1-score, and AUC metrics while maintaining high computational efficiency.Conclusion: This study validates the effectiveness of brain-inspired learning mechanisms in lightweight medical image analysis and provides a biologically plausible and deployable solution for intelligent clinical decision support under limited computational resources.
[128] Out-of-Distribution Detection Based on Total Variation Estimation
Dabiao Ma, Zhiba Su, Jian Yang, Haojun Fei
Main category: cs.CV
TL;DR: TV-OOD is a novel OOD detection method that uses Total Variation Network Estimator to calculate each input’s contribution to total variation as a score for discriminating between in-distribution and out-of-distribution data.
Details
Motivation: To improve security of ML model deployments against distribution shifts in practical applications by developing a more effective out-of-distribution detection method than existing approaches.Method: Leverages Total Variation Network Estimator to calculate each input’s contribution to overall total variation, defining this as the total variation score for OOD discrimination.
Result: Tested across various models and datasets, consistently yielding results in image classification tasks that were comparable or superior to state-of-the-art OOD detection techniques across all evaluation metrics.
Conclusion: TV-OOD represents an effective approach for OOD detection that outperforms or matches existing methods, enhancing ML model security against distribution shifts in real-world deployments.
Abstract: This paper introduces a novel approach to securing machine learning model deployments against potential distribution shifts in practical applications, the Total Variation Out-of-Distribution (TV-OOD) detection method. Existing methods have produced satisfactory results, but TV-OOD improves upon these by leveraging the Total Variation Network Estimator to calculate each input’s contribution to the overall total variation. By defining this as the total variation score, TV-OOD discriminates between in- and out-of-distribution data. The method’s efficacy was tested across a range of models and datasets, consistently yielding results in image classification tasks that were either comparable or superior to those achieved by leading-edge out-of-distribution detection techniques across all evaluation metrics.
[129] PMPBench: A Paired Multi-Modal Pan-Cancer Benchmark for Medical Image Synthesis
Yifan Chen, Fei Yin, Hao Chen, Jia Wu, Chao Li
Main category: cs.CV
TL;DR: Introduces PMPBench: first public, fully paired pan-cancer medical imaging dataset spanning 11 organs with complete DCE sequences and CT/CTC pairs for contrast synthesis research.
Details
Motivation: Contrast medium is essential for radiological imaging but not always feasible due to patient health or resource constraints. Existing datasets are limited to brain MR, have partial pairing, missing modalities, poor alignment, and lack phase labeling, with most resources remaining private.Method: Created a comprehensive public dataset with complete dynamic contrast-enhanced (DCE) sequences (all three phases) for MR and paired non-contrast/contrast-enhanced CT (CTC) across 11 human organs. Curated for anatomical correspondence to enable rigorous evaluation of 1-to-1, N-to-1, and N-to-N translation settings.
Result: Established a comprehensive benchmark with results from representative baselines of contemporary image-to-image translation. Released dataset and benchmark to catalyze research on safe, effective contrast synthesis for multi-organ oncology imaging.
Conclusion: PMPBench bridges the data gap in contrast synthesis research by providing the first public, fully paired pan-cancer imaging dataset, enabling rigorous evaluation and advancing AI-based contrast synthesis for safer clinical workflows.
Abstract: Contrast medium plays a pivotal role in radiological imaging, as it amplifies lesion conspicuity and improves detection for the diagnosis of tumor-related diseases. However, depending on the patient’s health condition or the medical resources available, the use of contrast medium is not always feasible. Recent work has explored AI-based image translation to synthesize contrast-enhanced images directly from non-contrast scans, aims to reduce side effects and streamlines clinical workflows. Progress in this direction has been constrained by data limitations: (1) existing public datasets focus almost exclusively on brain-related paired MR modalities; (2) other collections include partially paired data but suffer from missing modalities/timestamps and imperfect spatial alignment; (3) explicit labeling of CT vs. CTC or DCE phases is often absent; (4) substantial resources remain private. To bridge this gap, we introduce the first public, fully paired, pan-cancer medical imaging dataset spanning 11 human organs. The MR data include complete dynamic contrast-enhanced (DCE) sequences covering all three phases (DCE1-DCE3), while the CT data provide paired non-contrast and contrast-enhanced acquisitions (CTC). The dataset is curated for anatomical correspondence, enabling rigorous evaluation of 1-to-1, N-to-1, and N-to-N translation settings (e.g., predicting DCE phases from non-contrast inputs). Built upon this resource, we establish a comprehensive benchmark. We report results from representative baselines of contemporary image-to-image translation. We release the dataset and benchmark to catalyze research on safe, effective contrast synthesis, with direct relevance to multi-organ oncology imaging workflows. Our code and dataset are publicly available at https://github.com/YifanChen02/PMPBench.
[130] Understanding the Transfer Limits of Vision Foundation Models
Shiqi Huang, Yipei Wang, Natasha Thorley, Alexander Ng, Shaheer Saeed, Mark Emberton, Shonit Punwani, Veeru Kasivisvanathan, Dean Barratt, Daniel Alexander, Yipeng Hu
Main category: cs.CV
TL;DR: VFMs show uneven performance across tasks due to pretraining-downstream task misalignment. Study on prostate MRI tasks shows better alignment correlates with better transfer performance.
Details
Motivation: Vision foundation models (VFMs) show inconsistent improvements across downstream tasks despite heavy computational investment, likely due to mismatch between pretraining objectives and downstream task requirements.Method: Assessed two VFMs (MAE-based ProFound and contrastive-learning-based ProViCNet) on five prostate multiparametric MR imaging tasks. Measured task alignment using divergence metrics like MMD between features before/after fine-tuning.
Result: Better alignment between pretraining and downstream tasks correlates with greater performance improvements and faster convergence. Simple divergence metrics can predict transfer performance.
Conclusion: Pretraining objectives should be designed with downstream applicability in mind. Task alignment is crucial for effective transfer learning in vision foundation models.
Abstract: Foundation models leverage large-scale pretraining to capture extensive knowledge, demonstrating generalization in a wide range of language tasks. By comparison, vision foundation models (VFMs) often exhibit uneven improvements across downstream tasks, despite substantial computational investment. We postulate that this limitation arises from a mismatch between pretraining objectives and the demands of downstream vision-and-imaging tasks. Pretraining strategies like masked image reconstruction or contrastive learning shape representations for tasks such as recovery of generic visual patterns or global semantic structures, which may not align with the task-specific requirements of downstream applications including segmentation, classification, or image synthesis. To investigate this in a concrete real-world clinical area, we assess two VFMs, a reconstruction-focused MAE-based model (ProFound) and a contrastive-learning-based model (ProViCNet), on five prostate multiparametric MR imaging tasks, examining how such task alignment influences transfer performance, i.e., from pretraining to fine-tuning. Our findings indicate that better alignment between pretraining and downstream tasks, measured by simple divergence metrics such as maximum-mean-discrepancy (MMD) between the same features before and after fine-tuning, correlates with greater performance improvements and faster convergence, emphasizing the importance of designing and analyzing pretraining objectives with downstream applicability in mind.
[131] RadJEPA: Radiology Encoder for Chest X-Rays via Joint Embedding Predictive Architecture
Anas Anwarul Haq Khan, Mariam Husain, Kshitij Jadhav
Main category: cs.CV
TL;DR: RadJEPA is a self-supervised framework that learns robust radiology encoders without language supervision by predicting latent representations of masked image regions from chest X-rays.
Details
Motivation: Medical vision language models rely on paired image-text data, which is limited. The paper explores whether robust radiology encoders can be learned without language supervision, addressing data availability constraints.Method: RadJEPA uses a Joint Embedding Predictive Architecture pre-trained solely on unlabeled chest X-ray images. It learns to predict latent representations of masked image regions, focusing on latent-space prediction rather than aligning global representations across views or modalities.
Result: RadJEPA achieves performance exceeding state-of-the-art approaches including Rad-DINO across disease classification, semantic segmentation, and report generation benchmarks.
Conclusion: The framework demonstrates that robust radiology encoders can be effectively learned without language supervision through self-supervised latent-space prediction, outperforming existing methods across multiple medical imaging tasks.
Abstract: Recent advances in medical vision language models guide the learning of visual representations; however, this form of supervision is constrained by the availability of paired image text data, raising the question of whether robust radiology encoders can be learned without relying on language supervision. In this work, we introduce RadJEPA, a self-supervised framework built on a Joint Embedding Predictive Architecture that learns without language supervision. Pre-trained solely on unlabeled chest X-ray images, the model learns to predict latent representations of masked image regions. This predictive objective differs fundamentally from both image text pre-training and DINO-style self-distillation: rather than aligning global representations across views or modalities, RadJEPA explicitly models latent-space prediction. We evaluate the learned encoder on disease classification, semantic segmentation, and report generation tasks. Across benchmarks, RadJEPA achieves performance exceeding state-of-the-art approaches, including Rad-DINO.
[132] ThermoSplat: Cross-Modal 3D Gaussian Splatting with Feature Modulation and Geometry Decoupling
Zhaoqi Su, Shihai Chen, Xinyan Lin, Liqin Huang, Zhipeng Su, Xiaoqiang Lu
Main category: cs.CV
TL;DR: ThermoSplat extends 3D Gaussian Splatting to multi-spectral RGB-Thermal reconstruction using cross-modal feature modulation and adaptive geometry decoupling for improved rendering quality in both spectrums.
Details
Motivation: Multi-modal scene reconstruction with RGB and thermal infrared data is crucial for robust environmental perception across varying lighting and weather conditions, but current 3DGS approaches struggle to effectively leverage complementary information between modalities and handle cross-modal correlations and physical discrepancies.Method: 1) Cross-Modal FiLM Modulation: Dynamically conditions shared latent features on thermal structural priors to guide visible texture synthesis with cross-modal geometric cues. 2) Modality-Adaptive Geometric Decoupling: Learns independent opacity offsets and executes separate rasterization for thermal branch to handle modality-specific geometric inconsistencies. 3) Hybrid rendering pipeline integrates explicit Spherical Harmonics with implicit neural decoding for semantic consistency and high-frequency detail preservation.
Result: Extensive experiments on RGBT-Scenes dataset demonstrate state-of-the-art rendering quality across both visible and thermal spectrums.
Conclusion: ThermoSplat effectively addresses limitations in multi-spectral 3DGS by enabling deep spectral-aware reconstruction through active feature modulation and adaptive geometry decoupling, achieving superior performance in RGB-thermal scene reconstruction.
Abstract: Multi-modal scene reconstruction integrating RGB and thermal infrared data is essential for robust environmental perception across diverse lighting and weather conditions. However, extending 3D Gaussian Splatting (3DGS) to multi-spectral scenarios remains challenging. Current approaches often struggle to fully leverage the complementary information of multi-modal data, typically relying on mechanisms that either tend to neglect cross-modal correlations or leverage shared representations that fail to adaptively handle the complex structural correlations and physical discrepancies between spectrums. To address these limitations, we propose ThermoSplat, a novel framework that enables deep spectral-aware reconstruction through active feature modulation and adaptive geometry decoupling. First, we introduce a Cross-Modal FiLM Modulation mechanism that dynamically conditions shared latent features on thermal structural priors, effectively guiding visible texture synthesis with reliable cross-modal geometric cues. Second, to accommodate modality-specific geometric inconsistencies, we propose a Modality-Adaptive Geometric Decoupling scheme that learns independent opacity offsets and executes an independent rasterization pass for the thermal branch. Additionally, a hybrid rendering pipeline is employed to integrate explicit Spherical Harmonics with implicit neural decoding, ensuring both semantic consistency and high-frequency detail preservation. Extensive experiments on the RGBT-Scenes dataset demonstrate that ThermoSplat achieves state-of-the-art rendering quality across both visible and thermal spectrums.
[133] Opening the Black Box: Preliminary Insights into Affective Modeling in Multimodal Foundation Models
Zhen Zhang, Runhao Zeng, Sicheng Zhao, Xiping Hu
Main category: cs.CV
TL;DR: Affective capabilities in multimodal foundation models are primarily mediated by feed-forward gating projections (gate_proj), not attention modules, enabling efficient emotion understanding with minimal parameter tuning.
Details
Motivation: Despite strong empirical performance of affective models, the internal architectural mechanisms supporting emotion understanding and generation in multimodal foundation models remain poorly understood. The paper aims to systematically study how emotion-oriented supervision reshapes model parameters.Method: Conducted systematic mechanistic study across multiple architectures, training strategies, and affective tasks. Used controlled module transfer, targeted single-module adaptation, and destructive ablation to analyze parameter localization. Specifically investigated the role of gate_proj (feed-forward gating projection) versus attention modules.
Result: Affective adaptation consistently localizes to gate_proj rather than attention modules. Gate_proj is sufficient, efficient, and necessary for affective understanding and generation. By tuning only ~24.5% of parameters compared to AffectGPT, the approach achieves 96.6% of its average performance across eight affective tasks.
Conclusion: Affective capabilities in foundation models are structurally mediated by feed-forward gating mechanisms, with gate_proj identified as the central architectural locus for affective modeling. This provides empirical evidence for specific architectural components responsible for emotion processing.
Abstract: Understanding where and how emotions are represented in large-scale foundation models remains an open problem, particularly in multimodal affective settings. Despite the strong empirical performance of recent affective models, the internal architectural mechanisms that support affective understanding and generation are still poorly understood. In this work, we present a systematic mechanistic study of affective modeling in multimodal foundation models. Across multiple architectures, training strategies, and affective tasks, we analyze how emotion-oriented supervision reshapes internal model parameters. Our results consistently reveal a clear and robust pattern: affective adaptation does not primarily focus on the attention module, but instead localizes to the feed-forward gating projection (\texttt{gate_proj}). Through controlled module transfer, targeted single-module adaptation, and destructive ablation, we further demonstrate that \texttt{gate_proj} is sufficient, efficient, and necessary for affective understanding and generation. Notably, by tuning only approximately 24.5% of the parameters tuned by AffectGPT, our approach achieves 96.6% of its average performance across eight affective tasks, highlighting substantial parameter efficiency. Together, these findings provide empirical evidence that affective capabilities in foundation models are structurally mediated by feed-forward gating mechanisms and identify \texttt{gate_proj} as a central architectural locus of affective modeling.
[134] The Latency Wall: Benchmarking Off-the-Shelf Emotion Recognition for Real-Time Virtual Avatars
Yarin Benyamin
Main category: cs.CV
TL;DR: Benchmarking SOTA models for real-time emotion recognition in VR therapy for ASD reveals a “Latency Wall” where general-purpose Transformers fail to meet both accuracy and speed requirements, highlighting need for lightweight domain-specific architectures.
Details
Motivation: Real-time emotion recognition in VR/HCI could help individuals with ASD improve social skills, but requires strict latency-accuracy trade-off (MTP latency <140ms). Most DL models prioritize accuracy over timing constraints on commodity hardware.Method: Benchmarked SOTA models for Zero-Shot FER on virtual characters using UIBVFED dataset. Evaluated Medium and Nano variants of YOLO (v8, v11, v12) for face detection, and general-purpose Vision Transformers (CLIP, SigLIP, ViT-FER) on CPU-only inference.
Result: Face detection on stylized avatars is robust (100% accuracy). YOLOv11n offers optimal balance for detection (~54 ms). However, general-purpose Transformers fail to achieve viable accuracy (<23%) or speed (>150 ms) for real-time loops, creating a “Latency Wall” in classification stage.
Conclusion: Lightweight, domain-specific architectures are necessary to enable accessible, real-time AI in therapeutic VR settings, as current general-purpose models cannot meet both accuracy and latency requirements simultaneously.
Abstract: In the realm of Virtual Reality (VR) and Human-Computer Interaction (HCI), real-time emotion recognition shows promise for supporting individuals with Autism Spectrum Disorder (ASD) in improving social skills. This task requires a strict latency-accuracy trade-off, with motion-to-photon (MTP) latency kept below 140 ms to maintain contingency. However, most off-the-shelf Deep Learning models prioritize accuracy over the strict timing constraints of commodity hardware. As a first step toward accessible VR therapy, we benchmark State-of-the-Art (SOTA) models for Zero-Shot Facial Expression Recognition (FER) on virtual characters using the UIBVFED dataset. We evaluate Medium and Nano variants of YOLO (v8, v11, and v12) for face detection, alongside general-purpose Vision Transformers including CLIP, SigLIP, and ViT-FER.Our results on CPU-only inference demonstrate that while face detection on stylized avatars is robust (100% accuracy), a “Latency Wall” exists in the classification stage. The YOLOv11n architecture offers the optimal balance for detection (~54 ms). However, general-purpose Transformers like CLIP and SigLIP fail to achieve viable accuracy (<23%) or speed (>150 ms) for real-time loops. This study highlights the necessity for lightweight, domain-specific architectures to enable accessible, real-time AI in therapeutic settings.
[135] A Multi-View Pipeline and Benchmark Dataset for 3D Hand Pose Estimation in Surgery
Valery Fischer, Alan Magdaleno, Anna-Katharina Calek, Nicola Cavalcanti, Nathan Hoffman, Christoph Germann, Joschua Wüthrich, Max Krähenmann, Mazda Farshad, Philipp Fürnstahl, Lilian Calvet
Main category: cs.CV
TL;DR: A robust multi-view pipeline for 3D hand pose estimation in surgery using off-the-shelf models without fine-tuning, plus a new surgical benchmark dataset with 68K+ frames.
Details
Motivation: Surgical applications need accurate 3D hand pose estimation for skill assessment, robot-assisted interventions, and workflow analysis, but face challenges from intense lighting, occlusions, uniform glove appearance, and lack of annotated datasets.Method: Multi-view pipeline integrating person detection, whole-body pose estimation, 2D hand keypoint prediction on tracked hand crops, and constrained 3D optimization using only off-the-shelf pretrained models without domain-specific fine-tuning.
Result: Method outperforms baselines with 31% reduction in 2D mean joint error and 76% reduction in 3D mean per-joint position error.
Conclusion: Establishes strong baseline for surgical 3D hand pose estimation with training-free pipeline and comprehensive annotated dataset to advance surgical computer vision research.
Abstract: Purpose: Accurate 3D hand pose estimation supports surgical applications such as skill assessment, robot-assisted interventions, and geometry-aware workflow analysis. However, surgical environments pose severe challenges, including intense and localized lighting, frequent occlusions by instruments or staff, and uniform hand appearance due to gloves, combined with a scarcity of annotated datasets for reliable model training. Method: We propose a robust multi-view pipeline for 3D hand pose estimation in surgical contexts that requires no domain-specific fine-tuning and relies solely on off-the-shelf pretrained models. The pipeline integrates reliable person detection, whole-body pose estimation, and state-of-the-art 2D hand keypoint prediction on tracked hand crops, followed by a constrained 3D optimization. In addition, we introduce a novel surgical benchmark dataset comprising over 68,000 frames and 3,000 manually annotated 2D hand poses with triangulated 3D ground truth, recorded in a replica operating room under varying levels of scene complexity. Results: Quantitative experiments demonstrate that our method consistently outperforms baselines, achieving a 31% reduction in 2D mean joint error and a 76% reduction in 3D mean per-joint position error. Conclusion: Our work establishes a strong baseline for 3D hand pose estimation in surgery, providing both a training-free pipeline and a comprehensive annotated dataset to facilitate future research in surgical computer vision.
[136] Class Confidence Aware Reweighting for Long Tailed Learning
Brainard Philemon Jagati, Jitendra Tembhurne, Harsh Goud, Rudra Pratap Singh, Chandrashekhar Meshram
Main category: cs.CV
TL;DR: The paper proposes a class and confidence-aware re-weighting scheme for long-tailed learning that modulates training contributions based on prediction confidence and class frequency, complementing existing logit adjustment methods.
Details
Motivation: Deep neural networks degrade significantly in long-tailed data distributions where head classes dominate training data while tail classes have few examples. Existing methods focus mainly on logit-level adjustments to compensate for class-prior bias, with insufficient attention to optimization process adjustments based on sample confidence differences.Method: Design of a class and confidence-aware re-weighting scheme operating purely at the loss level. Uses an Ω(p_t, f_c) function to modulate training contributions based on prediction confidence values and relative class frequencies. This approach complements existing logit adjustment methods.
Result: Significant experimental results on CIFAR-100-LT, ImageNet-LT, and iNaturalist2018 datasets under various imbalance factors validate the theoretical discussions and demonstrate the effectiveness of the proposed approach.
Conclusion: The proposed confidence-aware re-weighting scheme effectively addresses long-tailed learning problems by considering both class frequency and prediction confidence, providing a complementary approach to existing logit adjustment methods.
Abstract: Deep neural network models degrade significantly in the long-tailed data distribution, with the overall training data dominated by a small set of classes in the head, and the tail classes obtaining less training examples. Addressing the imbalance in the classes, attention in the related literature was given mainly to the adjustments carried out in the decision space in terms of either corrections performed at the logit level in order to compensate class-prior bias, with the least attention to the optimization process resulting from the adjustments introduced through the differences in the confidences among the samples. In the current study, we present the design of a class and confidence-aware re-weighting scheme for long-tailed learning. This scheme is purely based upon the loss level and has a complementary nature to the existing methods performing the adjustment of the logits. In the practical implementation stage of the proposed scheme, we use an Ω(p_t, f_c) function. This function enables the modulation of the contribution towards the training task based upon the confidence value of the prediction, as well as the relative frequency of the corresponding class. Our observations in the experiments are corroborated by significant experimental results performed on the CIFAR-100-LT, ImageNet-LT, and iNaturalist2018 datasets under various values of imbalance factors that clearly authenticate the theoretical discussions above.
[137] NeuroMamba: Multi-Perspective Feature Interaction with Visual Mamba for Neuron Segmentation
Liuyun Jiang, Yizhuo Lu, Yanchao Zhang, Jiazheng Liu, Hua Han
Main category: cs.CV
TL;DR: NeuroMamba: A multi-perspective framework for neuron segmentation that combines Mamba-based global modeling with local feature extraction to handle irregular neuron morphology and dense structures, achieving SOTA performance on EM datasets.
Details
Motivation: Neuron segmentation is crucial for reconstructing brain connectomes but challenging due to irregular morphology and densely intertwined structures. CNN-based methods lack long-range context, while Transformer-based methods lose voxel-level details during patch partitioning.Method: Proposes NeuroMamba framework with: 1) Channel-gated Boundary Discriminative Feature Extractor (BDFE) for local morphological cues, 2) Spatial Continuous Feature Extractor (SCFE) with resolution-aware scanning in Visual Mamba architecture for global dependencies, and 3) cross-modulation mechanism to fuse multi-perspective features.
Result: Demonstrates state-of-the-art performance across four public EM datasets, showing exceptional adaptability to both anisotropic and isotropic resolutions.
Conclusion: NeuroMamba effectively addresses limitations of existing methods by combining patch-free global modeling with local feature extraction, enabling efficient capture of long-range dependencies while preserving fine-grained voxel details for neuron segmentation.
Abstract: Neuron segmentation is the cornerstone of reconstructing comprehensive neuronal connectomes, which is essential for deciphering the functional organization of the brain. The irregular morphology and densely intertwined structures of neurons make this task particularly challenging. Prevailing CNN-based methods often fail to resolve ambiguous boundaries due to the lack of long-range context, whereas Transformer-based methods suffer from boundary imprecision caused by the loss of voxel-level details during patch partitioning. To address these limitations, we propose NeuroMamba, a multi-perspective framework that exploits the linear complexity of Mamba to enable patch-free global modeling and synergizes this with complementary local feature modeling, thereby efficiently capturing long-range dependencies while meticulously preserving fine-grained voxel details. Specifically, we design a channel-gated Boundary Discriminative Feature Extractor (BDFE) to enhance local morphological cues. Complementing this, we introduce the Spatial Continuous Feature Extractor (SCFE), which integrates a resolution-aware scanning mechanism into the Visual Mamba architecture to adaptively model global dependencies across varying data resolutions. Finally, a cross-modulation mechanism synergistically fuses these multi-perspective features. Our method demonstrates state-of-the-art performance across four public EM datasets, validating its exceptional adaptability to both anisotropic and isotropic resolutions. The source code will be made publicly available.
[138] EVolSplat4D: Efficient Volume-based Gaussian Splatting for 4D Urban Scene Synthesis
Sheng Miao, Sijin Li, Pan Wang, Dongfeng Bai, Bingbing Liu, Yue Wang, Andreas Geiger, Yiyi Liao
Main category: cs.CV
TL;DR: EvolSplat4D is a feed-forward framework for novel view synthesis of urban scenes that unifies volume-based and pixel-based Gaussian prediction across three specialized branches for static close-range, dynamic actors, and far-field regions.
Details
Motivation: Existing methods struggle to balance reconstruction time with quality. Neural radiance fields and 3D Gaussian Splatting achieve photorealism but require time-consuming per-scene optimization, while feed-forward methods using per-pixel Gaussian representations suffer from 3D inconsistencies in complex dynamic environments.Method: A three-branch framework: 1) For close-range static regions: predict consistent 3D Gaussian geometry from 3D feature volumes with semantically-enhanced image-based rendering for appearance; 2) For dynamic actors: use object-centric canonical spaces and motion-adjusted rendering to aggregate temporal features; 3) For far-field scenery: efficient per-pixel Gaussian branch for full-scene coverage.
Result: Experimental results on KITTI-360, KITTI, Waymo, and PandaSet datasets show EvolSplat4D reconstructs both static and dynamic environments with superior accuracy and consistency, outperforming both per-scene optimization and state-of-the-art feed-forward baselines.
Conclusion: EvolSplat4D successfully addresses the trade-off between reconstruction time and quality in urban scene novel view synthesis by moving beyond per-pixel paradigms through a unified volume-based and pixel-based Gaussian prediction framework with specialized branches for different scene components.
Abstract: Novel view synthesis (NVS) of static and dynamic urban scenes is essential for autonomous driving simulation, yet existing methods often struggle to balance reconstruction time with quality. While state-of-the-art neural radiance fields and 3D Gaussian Splatting approaches achieve photorealism, they often rely on time-consuming per-scene optimization. Conversely, emerging feed-forward methods frequently adopt per-pixel Gaussian representations, which lead to 3D inconsistencies when aggregating multi-view predictions in complex, dynamic environments. We propose EvolSplat4D, a feed-forward framework that moves beyond existing per-pixel paradigms by unifying volume-based and pixel-based Gaussian prediction across three specialized branches. For close-range static regions, we predict consistent geometry of 3D Gaussians over multiple frames directly from a 3D feature volume, complemented by a semantically-enhanced image-based rendering module for predicting their appearance. For dynamic actors, we utilize object-centric canonical spaces and a motion-adjusted rendering module to aggregate temporal features, ensuring stable 4D reconstruction despite noisy motion priors. Far-Field scenery is handled by an efficient per-pixel Gaussian branch to ensure full-scene coverage. Experimental results on the KITTI-360, KITTI, Waymo, and PandaSet datasets show that EvolSplat4D reconstructs both static and dynamic environments with superior accuracy and consistency, outperforming both per-scene optimization and state-of-the-art feed-forward baselines.
[139] HyperAlign: Hypernetwork for Efficient Test-Time Alignment of Diffusion Models
Xin Xie, Jiaxian Guo, Dong Gong
Main category: cs.CV
TL;DR: HyperAlign: A hypernetwork-based framework for efficient test-time alignment of diffusion models that dynamically generates low-rank adaptation weights to improve semantic consistency and visual appeal without reward over-optimization or computational overhead.
Details
Motivation: Diffusion models often generate outputs misaligned with human preferences (poor aesthetics, semantic inconsistencies). Existing alignment methods face trade-offs: fine-tuning loses diversity through reward over-optimization, while test-time scaling has high computational cost and under-optimizes.Method: HyperAlign trains a hypernetwork to generate low-rank adaptation weights that dynamically modulate diffusion model’s generation operators based on input latents, timesteps, and prompts. Multiple variants balance performance/efficiency by varying hypernetwork application frequency. Optimized with reward score objective regularized with preference data to prevent reward hacking.
Result: Significantly outperforms existing fine-tuning and test-time scaling baselines in enhancing semantic consistency and visual appeal across multiple generative paradigms including Stable Diffusion and FLUX.
Conclusion: HyperAlign provides an effective and efficient solution for test-time alignment of diffusion models, addressing limitations of existing methods by avoiding reward over-optimization while maintaining computational efficiency and preventing reward hacking.
Abstract: Diffusion models achieve state-of-the-art performance but often fail to generate outputs that align with human preferences and intentions, resulting in images with poor aesthetic quality and semantic inconsistencies. Existing alignment methods present a difficult trade-off: fine-tuning approaches suffer from loss of diversity with reward over-optimization, while test-time scaling methods introduce significant computational overhead and tend to under-optimize. To address these limitations, we propose HyperAlign, a novel framework that trains a hypernetwork for efficient and effective test-time alignment. Instead of modifying latent states, HyperAlign dynamically generates low-rank adaptation weights to modulate the diffusion model’s generation operators. This allows the denoising trajectory to be adaptively adjusted based on input latents, timesteps and prompts for reward-conditioned alignment. We introduce multiple variants of HyperAlign that differ in how frequently the hypernetwork is applied, balancing between performance and efficiency. Furthermore, we optimize the hypernetwork using a reward score objective regularized with preference data to reduce reward hacking. We evaluate HyperAlign on multiple extended generative paradigms, including Stable Diffusion and FLUX. It significantly outperforms existing fine-tuning and test-time scaling baselines in enhancing semantic consistency and visual appeal.
[140] PhysicsMind: Sim and Real Mechanics Benchmarking for Physical Reasoning and Prediction in Foundational VLMs and World Models
Chak-Wing Mak, Guanyu Zhu, Boyi Zhang, Hongji Li, Xiaowei Chi, Kevin Zhang, Yichen Wu, Yangfan He, Chun-Kai Fan, Wentao Lu, Kuangzhi Ge, Xinyu Fang, Hongyang He, Kuan Lu, Tianxiang Xu, Li Zhang, Yongxin Ni, Youhua Li, Shanghang Zhang
Main category: cs.CV
TL;DR: PhysicsMind is a new benchmark that evaluates multimodal models’ understanding of physical laws through reasoning and generation tasks across real and simulated environments, focusing on center of mass, lever equilibrium, and Newton’s First Law.
Details
Motivation: Current MLLMs and video world models have advanced in various reasoning tasks but lack proper evaluation of their grasp of underlying physics. Existing benchmarks use synthetic templates or focus on perceptual quality rather than measuring adherence to physical laws.Method: Introduces PhysicsMind benchmark with real and simulation environments evaluating three physical principles: Center of Mass, Lever Equilibrium, and Newton’s First Law. Includes two main tasks: 1) VQA tasks testing physical quantity reasoning from images/videos, and 2) Video Generation tasks evaluating if predicted motion trajectories obey physical constraints.
Result: Evaluation of recent models shows they rely on appearance heuristics and often violate basic mechanics. Current scaling and training are insufficient for robust physical understanding, highlighting gaps in multimodal models’ physics comprehension.
Conclusion: PhysicsMind serves as a focused testbed for physics-aware multimodal models, revealing significant limitations in current models’ physical reasoning capabilities and providing a unified benchmark for future development.
Abstract: Modern foundational Multimodal Large Language Models (MLLMs) and video world models have advanced significantly in mathematical, common-sense, and visual reasoning, but their grasp of the underlying physics remains underexplored. Existing benchmarks attempting to measure this matter rely on synthetic, Visual Question Answer templates or focus on perceptual video quality that is tangential to measuring how well the video abides by physical laws. To address this fragmentation, we introduce PhysicsMind, a unified benchmark with both real and simulation environments that evaluates law-consistent reasoning and generation over three canonical principles: Center of Mass, Lever Equilibrium, and Newton’s First Law. PhysicsMind comprises two main tasks: i) VQA tasks, testing whether models can reason and determine physical quantities and values from images or short videos, and ii) Video Generation(VG) tasks, evaluating if predicted motion trajectories obey the same center-of-mass, torque, and inertial constraints as the ground truth. A broad range of recent models and video generation models is evaluated on PhysicsMind and found to rely on appearance heuristics while often violating basic mechanics. These gaps indicate that current scaling and training are still insufficient for robust physical understanding, underscoring PhysicsMind as a focused testbed for physics-aware multimodal models. Our data will be released upon acceptance.
[141] Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing
Tingyu Song, Yanzhao Zhang, Mingxin Li, Zhuoning Guo, Dingkun Long, Pengjun Xie, Siyue Zhang, Yilun Zhao, Shu Wu
Main category: cs.CV
TL;DR: EDIR is a new fine-grained Composed Image Retrieval benchmark created using image editing to generate diverse queries across 5 categories and 15 subcategories, revealing significant gaps in current multimodal models.
Details
Motivation: Current CIR benchmarks have limited query categories and fail to capture real-world diversity, creating an evaluation gap that needs to be addressed.Method: Used image editing for precise control over modification types and content to synthesize diverse queries, constructing EDIR with 5,000 high-quality queries across 5 main categories and 15 subcategories.
Result: Evaluation of 13 multimodal embedding models shows significant capability gaps - even state-of-the-art models struggle across all subcategories. The benchmark reveals modality biases and insufficient categorical coverage in existing benchmarks.
Conclusion: EDIR provides a rigorous benchmark that exposes limitations in current CIR models and distinguishes between solvable categories (with targeted data) and those revealing intrinsic architectural limitations.
Abstract: Composed Image Retrieval (CIR) is a pivotal and complex task in multimodal understanding. Current CIR benchmarks typically feature limited query categories and fail to capture the diverse requirements of real-world scenarios. To bridge this evaluation gap, we leverage image editing to achieve precise control over modification types and content, enabling a pipeline for synthesizing queries across a broad spectrum of categories. Using this pipeline, we construct EDIR, a novel fine-grained CIR benchmark. EDIR encompasses 5,000 high-quality queries structured across five main categories and fifteen subcategories. Our comprehensive evaluation of 13 multimodal embedding models reveals a significant capability gap; even state-of-the-art models (e.g., RzenEmbed and GME) struggle to perform consistently across all subcategories, highlighting the rigorous nature of our benchmark. Through comparative analysis, we further uncover inherent limitations in existing benchmarks, such as modality biases and insufficient categorical coverage. Furthermore, an in-domain training experiment demonstrates the feasibility of our benchmark. This experiment clarifies the task challenges by distinguishing between categories that are solvable with targeted data and those that expose intrinsic limitations of current model architectures.
[142] Keyframe-Based Feed-Forward Visual Odometry
Weichen Dai, Wenhan Su, Da Kong, Yuhang Ming, Wanzeng Kong
Main category: cs.CV
TL;DR: Proposes a keyframe-based feed-forward visual odometry method using reinforcement learning to adaptively select keyframes, improving efficiency and accuracy over current foundation model approaches.
Details
Motivation: Current visual foundation model based VO methods process all frames indiscriminately, causing computational redundancy and degraded performance due to low inter-frame parallax. Traditional keyframe heuristics don't work well with these models since they rely on high-dimensional latent representations rather than explicit geometric metrics.Method: Uses reinforcement learning to derive an adaptive keyframe policy in a data-driven manner, aligning selection with the intrinsic characteristics of the underlying foundation model. Instead of hand-crafted rules, learns optimal keyframe selection strategy.
Result: Trained on TartanAir dataset and evaluated across several real-world datasets. Achieves consistent and substantial improvements over state-of-the-art feed-forward VO methods.
Conclusion: The proposed reinforcement learning-based keyframe selection method effectively bridges the gap between traditional geometric heuristics and modern foundation model based VO, improving both efficiency and accuracy.
Abstract: The emergence of visual foundation models has revolutionized visual odometry~(VO) and SLAM, enabling pose estimation and dense reconstruction within a single feed-forward network. However, unlike traditional pipelines that leverage keyframe methods to enhance efficiency and accuracy, current foundation model based methods, such as VGGT-Long, typically process raw image sequences indiscriminately. This leads to computational redundancy and degraded performance caused by low inter-frame parallax, which provides limited contextual stereo information. Integrating traditional geometric heuristics into these methods is non-trivial, as their performance depends on high-dimensional latent representations rather than explicit geometric metrics. To bridge this gap, we propose a novel keyframe-based feed-forward VO. Instead of relying on hand-crafted rules, our approach employs reinforcement learning to derive an adaptive keyframe policy in a data-driven manner, aligning selection with the intrinsic characteristics of the underlying foundation model. We train our agent on TartanAir dataset and conduct extensive evaluations across several real-world datasets. Experimental results demonstrate that the proposed method achieves consistent and substantial improvements over state-of-the-art feed-forward VO methods.
[143] PAINT: Pathology-Aware Integrated Next-Scale Transformation for Virtual Immunohistochemistry
Rongze Ma, Mengkang Lu, Zhenyu Xiang, Yongsheng Pan, Yicheng Wu, Qingjie Zeng, Yong Xia
Main category: cs.CV
TL;DR: PAINT is a structure-first autoregressive framework for virtual immunohistochemistry that synthesizes molecular staining from H&E images by conditioning on global structural layouts rather than direct image translation.
Details
Motivation: Virtual IHC offers cost-effective alternatives to physical staining, but current methods struggle with semantic inconsistencies due to ambiguous cues in H&E morphology and insufficient structural priors.Method: PAINT reformulates synthesis as structure-first conditional generation using a Spatial Structural Start Map (3S-Map) to ground autoregressive initialization in observed morphology, enforcing causal order by resolving molecular details conditioned on global structural layout.
Result: PAINT outperforms state-of-the-art methods on IHC4BC and MIST datasets in structural fidelity and clinical downstream tasks.
Conclusion: The structure-guided autoregressive modeling approach validates the potential of prioritizing structural consistency in virtual IHC synthesis.
Abstract: Virtual immunohistochemistry (IHC) aims to computationally synthesize molecular staining patterns from routine Hematoxylin and Eosin (H&E) images, offering a cost-effective and tissue-efficient alternative to traditional physical staining. However, this task is particularly challenging: H&E morphology provides ambiguous cues about protein expression, and similar tissue structures may correspond to distinct molecular states. Most existing methods focus on direct appearance synthesis to implicitly achieve cross-modal generation, often resulting in semantic inconsistencies due to insufficient structural priors. In this paper, we propose Pathology-Aware Integrated Next-Scale Transformation (PAINT), a visual autoregressive framework that reformulates the synthesis process as a structure-first conditional generation task. Unlike direct image translation, PAINT enforces a causal order by resolving molecular details conditioned on a global structural layout. Central to this approach is the introduction of a Spatial Structural Start Map (3S-Map), which grounds the autoregressive initialization in observed morphology, ensuring deterministic, spatially aligned synthesis. Experiments on the IHC4BC and MIST datasets demonstrate that PAINT outperforms state-of-the-art methods in structural fidelity and clinical downstream tasks, validating the potential of structure-guided autoregressive modeling.
[144] ProGiDiff: Prompt-Guided Diffusion-Based Medical Image Segmentation
Yuan Lin, Murong Xu, Marc Hölle, Chinmay Prabhakar, Andreas Maier, Vasileios Belagiannis, Bjoern Menze, Suprosanna Shit
Main category: cs.CV
TL;DR: ProGiDiff: A novel framework that adapts pre-trained diffusion models for medical image segmentation using ControlNet-style conditioning and natural language prompts, enabling multi-class segmentation and cross-modality adaptation.
Details
Motivation: Current medical image segmentation methods are deterministic, lack natural language prompt capability, cannot estimate multiple proposals, and have limited human interaction and cross-modality adaptation. Text-to-image diffusion models show potential but require large datasets for training and are limited to binary segmentation.Method: ProGiDiff leverages existing image generation models with a ControlNet-style conditioning mechanism and custom encoder for image conditioning. It steers pre-trained diffusion models to output segmentation masks and naturally extends to multi-class segmentation by prompting target organs.
Result: Experiments on CT organ segmentation show strong performance compared to previous methods. The framework benefits from expert-in-the-loop settings to leverage multiple proposals. The conditioning mechanism can be easily transferred through low-rank, few-shot adaptation to segment MR images.
Conclusion: ProGiDiff successfully bridges the gap between deterministic segmentation methods and flexible, prompt-based approaches by adapting pre-trained diffusion models, enabling multi-class segmentation, human interaction, and cross-modality adaptation with minimal training data requirements.
Abstract: Widely adopted medical image segmentation methods, although efficient, are primarily deterministic and remain poorly amenable to natural language prompts. Thus, they lack the capability to estimate multiple proposals, human interaction, and cross-modality adaptation. Recently, text-to-image diffusion models have shown potential to bridge the gap. However, training them from scratch requires a large dataset-a limitation for medical image segmentation. Furthermore, they are often limited to binary segmentation and cannot be conditioned on a natural language prompt. To this end, we propose a novel framework called ProGiDiff that leverages existing image generation models for medical image segmentation purposes. Specifically, we propose a ControlNet-style conditioning mechanism with a custom encoder, suitable for image conditioning, to steer a pre-trained diffusion model to output segmentation masks. It naturally extends to a multi-class setting simply by prompting the target organ. Our experiment on organ segmentation from CT images demonstrates strong performance compared to previous methods and could greatly benefit from an expert-in-the-loop setting to leverage multiple proposals. Importantly, we demonstrate that the learned conditioning mechanism can be easily transferred through low-rank, few-shot adaptation to segment MR images.
[145] DTP: A Simple yet Effective Distracting Token Pruning Framework for Vision-Language Action Models
Chenyang Li, Jieyuan Liu, Bin Li, Bo Gao, Yilin Yuan, Yangfan He, Yuchen Li, Jingqun Tang
Main category: cs.CV
TL;DR: A plug-and-play Distracting Token Pruning (DTP) framework that dynamically detects and prunes distracting image tokens in Vision-Language Action models to improve task success rates without modifying model architecture.
Details
Motivation: VLA models often overly attend to task-irrelevant image regions (distracting tokens), which disturbs action token generation and reduces task success rates. The authors aim to correct visual attention patterns to improve performance.Method: Proposes Distracting Token Pruning (DTP) framework that dynamically identifies and prunes distracting image tokens during inference. The method is plug-and-play, requiring no architecture changes or additional inputs.
Result: Experiments on SIMPLER Benchmark show consistent relative improvements in task success rates across different VLA models. Analysis reveals negative correlation between task success rate and attention to task-irrelevant regions.
Conclusion: DTP effectively improves VLA model performance by pruning distracting tokens, demonstrating generalizability across transformer-based VLAs. The identified attention pattern issue provides guidance for future VLA research.
Abstract: Vision-Language Action (VLA) models have shown remarkable progress in robotic manipulation by leveraging the powerful perception abilities of Vision-Language Models (VLMs) to understand environments and directly output actions. However, by default, VLA models may overly attend to image tokens in the task-irrelevant region, which we describe as ‘distracting tokens’. This behavior can disturb the model from the generation of the desired action tokens in each step, affecting the success rate of tasks. In this paper, we introduce a simple yet effective plug-and-play Distracting Token Pruning (DTP) framework, which dynamically detects and prunes these distracting image tokens. By correcting the model’s visual attention patterns, we aim to improve the task success rate, as well as exploring the performance upper boundaries of the model without altering its original architecture or adding additional inputs. Experiments on the SIMPLER Benchmark (Li et al., 2024) show that our method consistently achieving relative improvements in task success rates across different types of novel VLA models, demonstrating generalizability to transformer-based VLAs. Further analysis reveals a negative correlation between the task success rate and the amount of attentions in the task-irrelevant region for all models tested, highlighting a common phenomenon of VLA models that could guide future research. We also publish our code at: https://anonymous.4open.science/r/CBD3.
[146] DSFedMed: Dual-Scale Federated Medical Image Segmentation via Mutual Distillation Between Foundation and Lightweight Models
Hanwen Zhang, Qiaojin Shen, Yuxi Liu, Yuesheng Zhu, Guibo Luo
Main category: cs.CV
TL;DR: DSFedMed is a dual-scale federated framework for medical image segmentation that enables mutual knowledge distillation between a centralized foundation model and lightweight client models, achieving better performance with 90% reduction in communication and inference costs.
Details
Motivation: Foundation models have strong generalization but face deployment challenges in federated settings due to high computational demands, communication overhead, and inference costs. There's a need for efficient federated learning solutions for medical image segmentation.Method: Proposes DSFedMed with dual-scale mutual knowledge distillation between centralized foundation model and lightweight client models. Uses generated high-quality medical images instead of real public datasets, and implements learnability-guided sample selection for efficient distillation.
Result: Achieves average 2% improvement in Dice score across five medical imaging segmentation datasets while reducing communication costs and inference time by nearly 90% compared to existing federated foundation model baselines.
Conclusion: DSFedMed demonstrates significant efficiency gains and scalability for resource-limited federated deployments, enabling practical use of foundation models in federated medical imaging settings.
Abstract: Foundation Models (FMs) have demonstrated strong generalization across diverse vision tasks. However, their deployment in federated settings is hindered by high computational demands, substantial communication overhead, and significant inference costs. We propose DSFedMed, a dual-scale federated framework that enables mutual knowledge distillation between a centralized foundation model and lightweight client models for medical image segmentation. To support knowledge distillation, a set of high-quality medical images is generated to replace real public datasets, and a learnability-guided sample selection strategy is proposed to enhance efficiency and effectiveness in dual-scale distillation. This mutual distillation enables the foundation model to transfer general knowledge to lightweight clients, while also incorporating client-specific insights to refine the foundation model. Evaluations on five medical imaging segmentation datasets show that DSFedMed achieves an average 2 percent improvement in Dice score while reducing communication costs and inference time by nearly 90 percent compared to existing federated foundation model baselines. These results demonstrate significant efficiency gains and scalability for resource-limited federated deployments.
[147] Masked Modeling for Human Motion Recovery Under Occlusions
Zhiyin Qian, Siwei Zhang, Bharat Lal Bhatnagar, Federica Bogo, Siyu Tang
Main category: cs.CV
TL;DR: MoRo is a real-time human motion reconstruction framework that uses masked modeling to handle occlusions in monocular videos, achieving 70 FPS performance while outperforming state-of-the-art methods.
Details
Motivation: Existing methods for human motion reconstruction from monocular videos struggle with real-world occlusions - regression methods are fragile to missing observations, while optimization/diffusion approaches are too slow for real-time applications.Method: MoRo uses masked modeling for occlusion-robust motion recovery, with a cross-modality learning scheme combining: 1) trajectory-aware motion prior from MoCap data, 2) image-conditioned pose prior from image-pose data, and 3) video-conditioned masked transformer fine-tuned on video-motion data.
Result: MoRo substantially outperforms state-of-the-art methods on EgoBody and RICH datasets in accuracy and motion realism under occlusions, performs on-par in non-occluded scenarios, and achieves real-time inference at 70 FPS on a single H200 GPU.
Conclusion: MoRo provides an efficient, end-to-end solution for robust human motion reconstruction from monocular videos that handles occlusions effectively while maintaining real-time performance, addressing key limitations of existing approaches.
Abstract: Human motion reconstruction from monocular videos is a fundamental challenge in computer vision, with broad applications in AR/VR, robotics, and digital content creation, but remains challenging under frequent occlusions in real-world settings.Existing regression-based methods are efficient but fragile to missing observations, while optimization- and diffusion-based approaches improve robustness at the cost of slow inference speed and heavy preprocessing steps. To address these limitations, we leverage recent advances in generative masked modeling and present MoRo: Masked Modeling for human motion Recovery under Occlusions. MoRo is an occlusion-robust, end-to-end generative framework that formulates motion reconstruction as a video-conditioned task, and efficiently recover human motion in a consistent global coordinate system from RGB videos. By masked modeling, MoRo naturally handles occlusions while enabling efficient, end-to-end inference. To overcome the scarcity of paired video-motion data, we design a cross-modality learning scheme that learns multi-modal priors from a set of heterogeneous datasets: (i) a trajectory-aware motion prior trained on MoCap datasets, (ii) an image-conditioned pose prior trained on image-pose datasets, capturing diverse per-frame poses, and (iii) a video-conditioned masked transformer that fuses motion and pose priors, finetuned on video-motion datasets to integrate visual cues with motion dynamics for robust inference. Extensive experiments on EgoBody and RICH demonstrate that MoRo substantially outperforms state-of-the-art methods in accuracy and motion realism under occlusions, while performing on-par in non-occluded scenarios. MoRo achieves real-time inference at 70 FPS on a single H200 GPU.
[148] SAMTok: Representing Any Mask with Two Words
Yikang Zhou, Tao Zhang, Dengxian Gong, Yuanzheng Wu, Ye Tian, Haochen Wang, Haobo Yuan, Jiacong Wang, Lu Qi, Hao Fei, Anran Wang, Zhuochen Wang, Yujing Wang, Cheng Chen, Shunping Ji, Xiangtai Li
Main category: cs.CV
TL;DR: SAMTok converts region masks into discrete tokens, enabling MLLMs to learn pixel-wise capabilities through standard language modeling without architectural changes.
Details
Motivation: Pixel-wise capabilities are essential for interactive intelligent systems, but current pixel-wise MLLMs are difficult to scale due to complex region encoders, specialized segmentation decoders, and incompatible training objectives.Method: SAMTok is a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs masks with high fidelity. Built on SAM2 and trained on 209M diverse masks using mask encoder and residual vector quantizer. Enables MLLMs to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning without architectural modifications.
Result: QwenVL-SAMTok achieves state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. Reinforcement learning with textual answer-matching reward delivers substantial improvements on GRES and GCG benchmarks.
Conclusion: SAMTok demonstrates a scalable and straightforward paradigm for equipping MLLMs with strong pixel-wise capabilities, treating masks as language tokens to avoid complex architectural changes and specialized loss designs.
Abstract: Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs the mask using these tokens with high fidelity. By treating masks as new language tokens, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok-formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a scalable and straightforward paradigm for equipping MLLMs with strong pixel-wise capabilities. Our code and models are available.
[149] Clustering-Guided Spatial-Spectral Mamba for Hyperspectral Image Classification
Zack Dewis, Yimin Zhu, Zhengsen Xu, Mabel Heffring, Saeid Taleghanidoozdoozan, Quinn Ledingham, Lincoln Linlin Xu
Main category: cs.CV
TL;DR: CSSMamba is a clustering-guided spatial-spectral Mamba framework for hyperspectral image classification that integrates clustering mechanisms with Mamba architecture to create efficient adaptive token sequences and improve feature learning.
Details
Motivation: Current Mamba models for HSI classification face challenges in defining efficient and adaptive token sequences for improved performance. There's a need to better integrate spatial and spectral information while optimizing token sequencing.Method: 1) Cluster-guided spatial Mamba module (CSpaMamba) integrates clustering into spatial Mamba to reduce sequence length and improve feature learning. 2) Combines CSpaMamba with spectral Mamba module (SpeMamba) for spatial-spectral learning. 3) Attention-Driven Token Selection mechanism optimizes Mamba token sequencing. 4) Learnable Clustering Module adaptively learns cluster memberships.
Result: Experiments on Pavia University, Indian Pines, and Liao-Ning 01 datasets show CSSMamba achieves higher accuracy and better boundary preservation compared to state-of-the-art CNN, Transformer, and Mamba-based methods.
Conclusion: CSSMamba effectively addresses token sequence challenges in Mamba models for HSI classification through clustering integration, achieving superior performance in both accuracy and boundary preservation.
Abstract: Although Mamba models greatly improve Hyperspectral Image (HSI) classification, they have critical challenges in terms defining efficient and adaptive token sequences for improve performance. This paper therefore presents CSSMamba (Clustering-guided Spatial-Spectral Mamba) framework to better address the challenges, with the following contributions. First, to achieve efficient and adaptive token sequences for improved Mamba performance, we integrate the clustering mechanism into a spatial Mamba architecture, leading to a cluster-guided spatial Mamba module (CSpaMamba) that reduces the Mamba sequence length and improves Mamba feature learning capability. Second, to improve the learning of both spatial and spectral information, we integrate the CSpaMamba module with a spectral mamba module (SpeMamba), leading to a complete clustering-guided spatial-spectral Mamba framework. Third, to further improve feature learning capability, we introduce an Attention-Driven Token Selection mechanism to optimize Mamba token sequencing. Last, to seamlessly integrate clustering into the Mamba model in a coherent manner, we design a Learnable Clustering Module that learns the cluster memberships in an adaptive manner. Experiments on the Pavia University, Indian Pines, and Liao-Ning 01 datasets demonstrate that CSSMamba achieves higher accuracy and better boundary preservation compared to state-of-the-art CNN, Transformer, and Mamba-based methods.
[150] Learning to Watermark in the Latent Space of Generative Models
Sylvestre-Alvise Rebuffi, Tuan Tran, Valeriu Lacatusu, Pierre Fernandez, Tomáš Souček, Nikola Jovanović, Tom Sander, Hady Elsahar, Alexandre Mourachko
Main category: cs.CV
TL;DR: DistSeal introduces latent space watermarking for AI-generated images, achieving competitive robustness with 20x speedup and better imperceptibility compared to pixel-space methods.
Details
Motivation: Existing pixel-space watermarking methods for AI-generated images have computational overhead and cause visual artifacts, creating a need for more efficient and less intrusive watermarking solutions.Method: Train post-hoc watermarking models in the latent space of generative models (diffusion and autoregressive), then distill them either into the generative model itself or into the latent decoder for in-model watermarking.
Result: Latent watermarks achieve competitive robustness with similar imperceptibility and up to 20x speedup compared to pixel-space baselines. Distilling latent watermarkers outperforms distilling pixel-space ones in both efficiency and robustness.
Conclusion: Latent space watermarking via DistSeal provides a unified, efficient solution for watermarking AI-generated images across different generative models, offering significant speed advantages while maintaining robustness and imperceptibility.
Abstract: Existing approaches for watermarking AI-generated images often rely on post-hoc methods applied in pixel space, introducing computational overhead and potential visual artifacts. In this work, we explore latent space watermarking and introduce DistSeal, a unified approach for latent watermarking that works across both diffusion and autoregressive models. Our approach works by training post-hoc watermarking models in the latent space of generative models. We demonstrate that these latent watermarkers can be effectively distilled either into the generative model itself or into the latent decoder, enabling in-model watermarking. The resulting latent watermarks achieve competitive robustness while offering similar imperceptibility and up to 20x speedup compared to pixel-space baselines. Our experiments further reveal that distilling latent watermarkers outperforms distilling pixel-space ones, providing a solution that is both more efficient and more robust.
[151] ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion
Remy Sabathier, David Novotny, Niloy J. Mitra, Tom Monnier
Main category: cs.CV
TL;DR: ActionMesh: A feed-forward generative model that produces production-ready animated 3D meshes using temporal 3D diffusion, enabling fast generation from various inputs like video, text, or 3D mesh with animation prompts.
Details
Motivation: Existing 3D animation generation methods have practical limitations including complex setups, slow runtime, and quality constraints, making them difficult to apply in real-world applications.Method: Two-stage approach: 1) Adapt 3D diffusion models with temporal axis to generate synchronized latent sequences of time-varying shapes, 2) Temporal 3D autoencoder translates independent shapes into deformations of a reference shape to create animations.
Result: State-of-the-art performance on video-to-4D benchmarks (Consistent4D, Objaverse) in both geometric accuracy and temporal consistency, with unprecedented speed and production-ready quality.
Conclusion: ActionMesh enables rapid generation of rig-free, topology-consistent animated 3D meshes that are production-ready and support seamless applications like texturing and retargeting.
Abstract: Generating animated 3D objects is at the heart of many applications, yet most advanced works are typically difficult to apply in practice because of their limited setup, their long runtime, or their limited quality. We introduce ActionMesh, a generative model that predicts production-ready 3D meshes “in action” in a feed-forward manner. Drawing inspiration from early video models, our key insight is to modify existing 3D diffusion models to include a temporal axis, resulting in a framework we dubbed “temporal 3D diffusion”. Specifically, we first adapt the 3D diffusion stage to generate a sequence of synchronized latents representing time-varying and independent 3D shapes. Second, we design a temporal 3D autoencoder that translates a sequence of independent shapes into the corresponding deformations of a pre-defined reference shape, allowing us to build an animation. Combining these two components, ActionMesh generates animated 3D meshes from different inputs like a monocular video, a text description, or even a 3D mesh with a text prompt describing its animation. Besides, compared to previous approaches, our method is fast and produces results that are rig-free and topology consistent, hence enabling rapid iteration and seamless applications like texturing and retargeting. We evaluate our model on standard video-to-4D benchmarks (Consistent4D, Objaverse) and report state-of-the-art performances on both geometric accuracy and temporal consistency, demonstrating that our model can deliver animated 3D meshes with unprecedented speed and quality.
[152] HVD: Human Vision-Driven Video Representation Learning for Text-Video Retrieval
Zequn Xie, Xin Liu, Boyun Zhang, Yuxiao Lin, Sihang Cai, Tao Jin
Main category: cs.CV
TL;DR: HVD model improves text-video retrieval by mimicking human vision with coarse-to-fine alignment, using frame selection and patch compression to focus on key visual information.
Details
Motivation: Current CLIP-based text-video retrieval methods suffer from "blind" feature interaction where models struggle to distinguish key visual information from background noise due to sparse textual queries, leading to inefficient feature matching.Method: Proposes Human Vision-Driven (HVD) model with coarse-to-fine alignment: 1) Frame Features Selection Module (FFSM) mimics human macro-perception by selecting key frames to eliminate temporal redundancy; 2) Patch Features Compression Module (PFCM) simulates micro-perception by aggregating patch features into salient visual entities using advanced attention mechanism for precise entity-level matching.
Result: Extensive experiments on five benchmarks demonstrate that HVD captures human-like visual focus and achieves state-of-the-art performance in text-video retrieval.
Conclusion: The HVD framework successfully addresses the “blind” feature interaction problem by mimicking human cognitive behavior, establishing effective coarse-to-fine alignment that improves text-video retrieval performance through better visual focus and entity-level matching.
Abstract: The success of CLIP has driven substantial progress in text-video retrieval. However, current methods often suffer from “blind” feature interaction, where the model struggles to discern key visual information from background noise due to the sparsity of textual queries. To bridge this gap, we draw inspiration from human cognitive behavior and propose the Human Vision-Driven (HVD) model. Our framework establishes a coarse-to-fine alignment mechanism comprising two key components: the Frame Features Selection Module (FFSM) and the Patch Features Compression Module (PFCM). FFSM mimics the human macro-perception ability by selecting key frames to eliminate temporal redundancy. Subsequently, PFCM simulates micro-perception by aggregating patch features into salient visual entities through an advanced attention mechanism, enabling precise entity-level matching. Extensive experiments on five benchmarks demonstrate that HVD not only captures human-like visual focus but also achieves state-of-the-art performance.
[153] 360Anything: Geometry-Free Lifting of Images and Videos to 360°
Ziyi Wu, Daniel Watson, Andrea Tagliasacchi, David J. Fleet, Marcus A. Brubaker, Saurabh Saxena
Main category: cs.CV
TL;DR: 360Anything is a geometry-free framework that uses diffusion transformers to generate 360° panoramas from perspective images/videos without requiring camera metadata, achieving SOTA results and addressing seam artifacts.
Details
Motivation: Existing methods for lifting perspective images to 360° panoramas rely on explicit geometric alignment requiring known camera metadata, which limits application to in-the-wild data where such calibration is typically absent or noisy.Method: A geometry-free framework built upon pre-trained diffusion transformers that treats perspective input and panorama target as token sequences, learning the perspective-to-equirectangular mapping purely data-driven. Introduces Circular Latent Encoding to address seam artifacts at ERP boundaries caused by zero-padding in VAE encoder.
Result: Achieves state-of-the-art performance on both image and video perspective-to-360° generation, outperforming prior works that use ground-truth camera information. Shows competitive results in zero-shot camera FoV and orientation estimation benchmarks.
Conclusion: 360Anything eliminates the need for camera information while achieving superior performance, demonstrates deep geometric understanding, and has broader utility in computer vision tasks beyond panorama generation.
Abstract: Lifting perspective images and videos to 360° panoramas enables immersive 3D world generation. Existing approaches often rely on explicit geometric alignment between the perspective and the equirectangular projection (ERP) space. Yet, this requires known camera metadata, obscuring the application to in-the-wild data where such calibration is typically absent or noisy. We propose 360Anything, a geometry-free framework built upon pre-trained diffusion transformers. By treating the perspective input and the panorama target simply as token sequences, 360Anything learns the perspective-to-equirectangular mapping in a purely data-driven way, eliminating the need for camera information. Our approach achieves state-of-the-art performance on both image and video perspective-to-360° generation, outperforming prior works that use ground-truth camera information. We also trace the root cause of the seam artifacts at ERP boundaries to zero-padding in the VAE encoder, and introduce Circular Latent Encoding to facilitate seamless generation. Finally, we show competitive results in zero-shot camera FoV and orientation estimation benchmarks, demonstrating 360Anything’s deep geometric understanding and broader utility in computer vision tasks. Additional results are available at https://360anything.github.io/.
[154] Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders
Shengbang Tong, Boyang Zheng, Ziteng Wang, Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, Saining Xie
Main category: cs.CV
TL;DR: RAEs outperform VAEs for large-scale text-to-image generation, showing better stability, faster convergence, and higher quality across model scales from 0.5B to 9.8B parameters.
Details
Motivation: To investigate whether Representation Autoencoders (RAEs), which showed advantages on ImageNet, can scale to large-scale freeform text-to-image generation and compare them against state-of-the-art VAEs like FLUX.Method: Scaling RAE decoders on frozen SigLIP-2 encoder using web, synthetic, and text-rendering data; stress-testing RAE design choices; conducting controlled comparison of RAE vs FLUX VAE across diffusion transformer scales (0.5B to 9.8B parameters).
Result: RAEs consistently outperform VAEs during pretraining across all scales. During finetuning, VAEs catastrophically overfit after 64 epochs while RAEs remain stable through 256 epochs with better performance. RAEs show faster convergence and better generation quality.
Conclusion: RAEs are a simpler and stronger foundation than VAEs for large-scale T2I generation. The shared representation space enables unified models where visual understanding and generation can operate together, opening new possibilities for multimodal reasoning.
Abstract: Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decoders on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented decoding offer negligible benefits at scale Building on this simplified framework, we conduct a controlled comparison of RAE against the state-of-the-art FLUX VAE across diffusion transformer scales from 0.5B to 9.8B parameters. RAEs consistently outperform VAEs during pretraining across all model scales. Further, during finetuning on high-quality datasets, VAE-based models catastrophically overfit after 64 epochs, while RAE models remain stable through 256 epochs and achieve consistently better performance. Across all experiments, RAE-based diffusion models demonstrate faster convergence and better generation quality, establishing RAEs as a simpler and stronger foundation than VAEs for large-scale T2I generation. Additionally, because both visual understanding and generation can operate in a shared representation space, the multimodal model can directly reason over generated latents, opening new possibilities for unified models.
[155] PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation
Onkar Susladkar, Tushar Prakash, Adheesh Juvekar, Kiet A. Nguyen, Dong-Hwan Jang, Inderjit S Dhillon, Ismini Lourentzou
Main category: cs.CV
TL;DR: PyraTok is a language-aligned pyramidal video tokenizer that learns multi-scale discrete latents with shared binary codebooks, achieving SOTA performance across video generation and understanding tasks.
Details
Motivation: Existing video VAEs use single-scale tokenizers with limited vocabularies and weak language supervision, resulting in poor cross-modal alignment and zero-shot transfer capabilities.Method: Builds on pretrained video VAE with Language-aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at multiple depths using shared large binary codebook, jointly optimizing multi-scale text-guided quantization and global autoregressive objective over token hierarchy.
Result: Achieves SOTA video reconstruction, improves text-to-video quality, and sets new SOTA zero-shot performance on video segmentation, temporal action localization, and video understanding across ten benchmarks, scaling to 4K/8K resolutions.
Conclusion: PyraTok demonstrates that language-aligned pyramidal tokenization with multi-scale discrete latents significantly enhances video representation learning, enabling superior cross-modal alignment and zero-shot transfer capabilities.
Abstract: Discrete video VAEs underpin modern text-to-video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross-modal alignment and zero-shot transfer. We introduce PyraTok, a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi-scale text-guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks, PyraTok delivers state-of-the-art (SOTA) video reconstruction, consistently improves text-to-video quality, and sets new SOTA zero-shot performance on video segmentation, temporal action localization, and video understanding, scaling robustly to up to 4K/8K resolutions.
[156] Why Can’t I Open My Drawer? Mitigating Object-Driven Shortcuts in Zero-Shot Compositional Action Recognition
Geo Ahn, Inwoong Lee, Taeoh Kim, Minho Shim, Dongyoon Wee, Jinwoo Choi
Main category: cs.CV
TL;DR: RCORE framework addresses object-driven verb shortcuts in Zero-Shot Compositional Action Recognition by enforcing temporally grounded verb learning through composition-aware augmentation and temporal order regularization, significantly improving unseen composition accuracy.
Details
Motivation: Existing Zero-Shot Compositional Action Recognition models fail due to object-driven verb shortcuts, where models ignore visual evidence and overfit to co-occurrence statistics instead of learning true compositional understanding.Method: RCORE framework with two key components: (1) composition-aware augmentation that diversifies verb-object combinations without corrupting motion cues, and (2) temporal order regularization loss that penalizes shortcut behaviors by explicitly modeling temporal structure.
Result: RCORE significantly improves unseen composition accuracy across Sth-com and EK100-com benchmarks, reduces reliance on co-occurrence bias, and achieves consistently positive compositional gaps.
Conclusion: Object-driven shortcuts are a critical limiting factor in ZS-CAR, and addressing them through temporally grounded verb learning is essential for robust compositional video understanding.
Abstract: We study Compositional Video Understanding (CVU), where models must recognize verbs and objects and compose them to generalize to unseen combinations. We find that existing Zero-Shot Compositional Action Recognition (ZS-CAR) models fail primarily due to an overlooked failure mode: object-driven verb shortcuts. Through systematic analysis, we show that this behavior arises from two intertwined factors: severe sparsity and skewness of compositional supervision, and the asymmetric learning difficulty between verbs and objects. As training progresses, the existing ZS-CAR model increasingly ignores visual evidence and overfits to co-occurrence statistics. Consequently, the existing model does not gain the benefit of compositional recognition in unseen verb-object compositions. To address this, we propose RCORE, a simple and effective framework that enforces temporally grounded verb learning. RCORE introduces (i) a composition-aware augmentation that diversifies verb-object combinations without corrupting motion cues, and (ii) a temporal order regularization loss that penalizes shortcut behaviors by explicitly modeling temporal structure. Across two benchmarks, Sth-com and our newly constructed EK100-com, RCORE significantly improves unseen composition accuracy, reduces reliance on co-occurrence bias, and achieves consistently positive compositional gaps. Our findings reveal object-driven shortcuts as a critical limiting factor in ZS-CAR and demonstrate that addressing them is essential for robust compositional video understanding.
[157] CamPilot: Improving Camera Control in Video Diffusion Model with Efficient Camera Reward Feedback
Wenhang Ge, Guibao Shen, Jiawei Feng, Luozhou Wang, Hao Lu, Xingye Tian, Xin Tao, Ying-Cong Chen
Main category: cs.CV
TL;DR: CamPilot improves camera controllability in video diffusion models by introducing a camera-aware 3D decoder that converts video latents into 3D Gaussians for efficient reward computation.
Details
Motivation: Current camera-controlled video diffusion models have limited camera controllability, and existing Reward Feedback Learning approaches face challenges: lack of video-camera alignment assessment, computational overhead from RGB decoding, and neglect of 3D geometric information.Method: Proposes an efficient camera-aware 3D decoder that decodes video latents with camera poses into 3D Gaussians. Camera poses serve as both input and projection parameters, where misalignment causes geometric distortions. Optimizes pixel-level consistency between rendered novel views and ground-truth as reward, with a visibility term to handle stochastic regions via geometric warping.
Result: Extensive experiments on RealEstate10K and WorldScore benchmarks demonstrate the effectiveness of the proposed method in improving camera controllability.
Conclusion: The camera-aware 3D decoder approach successfully addresses limitations of existing ReFL methods by enabling efficient 3D-aware reward computation for better video-camera alignment in diffusion models.
Abstract: Recent advances in camera-controlled video diffusion models have significantly improved video-camera alignment. However, the camera controllability still remains limited. In this work, we build upon Reward Feedback Learning and aim to further improve camera controllability. However, directly borrowing existing ReFL approaches faces several challenges. First, current reward models lack the capacity to assess video-camera alignment. Second, decoding latent into RGB videos for reward computation introduces substantial computational overhead. Third, 3D geometric information is typically neglected during video decoding. To address these limitations, we introduce an efficient camera-aware 3D decoder that decodes video latent into 3D representations for reward quantization. Specifically, video latent along with the camera pose are decoded into 3D Gaussians. In this process, the camera pose not only acts as input, but also serves as a projection parameter. Misalignment between the video latent and camera pose will cause geometric distortions in the 3D structure, resulting in blurry renderings. Based on this property, we explicitly optimize pixel-level consistency between the rendered novel views and ground-truth ones as reward. To accommodate the stochastic nature, we further introduce a visibility term that selectively supervises only deterministic regions derived via geometric warping. Extensive experiments conducted on RealEstate10K and WorldScore benchmarks demonstrate the effectiveness of our proposed method. Project page: \href{https://a-bigbao.github.io/CamPilot/}{CamPilot Page}.
[158] Multi-event Video-Text Retrieval
Gengyuan Zhang, Jisen Ren, Jindong Gu, Volker Tresp
Main category: cs.CV
TL;DR: Proposes Multi-event Video-Text Retrieval (MeVTR) task and Me-Retriever model to handle videos with multiple events, addressing limitations of traditional VTR models that assume bijective video-text correspondences.
Details
Motivation: Traditional Video-Text Retrieval models assume bijective video-text correspondences, but real-world videos often contain multiple events while texts (queries/metadata) are specific to single events, creating a gap between training objectives and practical applications.Method: Introduces Me-Retriever with key event video representation and a new MeVTR loss specifically designed for the multi-event scenario.
Result: Me-Retriever outperforms other models in both Video-to-Text and Text-to-Video retrieval tasks, establishing a strong baseline for the MeVTR task.
Conclusion: The work addresses a practical limitation in VTR and provides a foundation for future research on multi-event video-text retrieval scenarios.
Abstract: Video-Text Retrieval (VTR) is a crucial multi-modal task in an era of massive video-text data on the Internet. A plethora of work characterized by using a two-stream Vision-Language model architecture that learns a joint representation of video-text pairs has become a prominent approach for the VTR task. However, these models operate under the assumption of bijective video-text correspondences and neglect a more practical scenario where video content usually encompasses multiple events, while texts like user queries or webpage metadata tend to be specific and correspond to single events. This establishes a gap between the previous training objective and real-world applications, leading to the potential performance degradation of earlier models during inference. In this study, we introduce the Multi-event Video-Text Retrieval (MeVTR) task, addressing scenarios in which each video contains multiple different events, as a niche scenario of the conventional Video-Text Retrieval Task. We present a simple model, Me-Retriever, which incorporates key event video representation and a new MeVTR loss for the MeVTR task. Comprehensive experiments show that this straightforward framework outperforms other models in the Video-to-Text and Text-to-Video tasks, effectively establishing a robust baseline for the MeVTR task. We believe this work serves as a strong foundation for future studies. Code is available at https://github.com/gengyuanmax/MeVTR.
[159] Efficient Multimodal Large Language Models: A Survey
Yizhang Jin, Jian Li, Yexin Liu, Tianjun Gu, Kai Wu, Zhengkai Jiang, Muyang He, Bo Zhao, Xin Tan, Zhenye Gan, Yabiao Wang, Chengjie Wang, Lizhuang Ma
Main category: cs.CV
TL;DR: A comprehensive survey on efficient Multimodal Large Language Models (MLLMs) that addresses their computational challenges and explores lightweight solutions for edge computing applications.
Details
Motivation: Despite impressive performance in visual tasks, MLLMs face deployment barriers due to large model sizes and high computational costs, creating a need for efficient solutions suitable for edge computing scenarios.Method: Systematic review approach summarizing timeline of efficient MLLMs, analyzing efficient structures and strategies, and surveying applications across different domains.
Result: Provides comprehensive overview of current efficient MLLM research landscape, identifies existing limitations, and outlines promising future research directions in the field.
Conclusion: Efficient MLLMs have significant potential for practical deployment, particularly in edge computing, but require further research to overcome current limitations and advance the field.
Abstract: In the past year, Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering, visual understanding and reasoning. However, the extensive model size and high training and inference costs have hindered the widespread application of MLLMs in academia and industry. Thus, studying efficient and lightweight MLLMs has enormous potential, especially in edge computing scenarios. In this survey, we provide a comprehensive and systematic review of the current state of efficient MLLMs. Specifically, we summarize the timeline of representative efficient MLLMs, research state of efficient structures and strategies, and the applications. Finally, we discuss the limitations of current efficient MLLM research and promising future directions. Please refer to our GitHub repository for more details: https://github.com/lijiannuist/Efficient-Multimodal-LLMs-Survey.
[160] CropCraft: Complete Structural Characterization of Crop Plants From Images
Albert J. Zhai, Xinlei Wang, Kaiyuan Li, Zhao Jiang, Junxiong Zhou, Sheng Wang, Zhenong Jin, Kaiyu Guan, Shenlong Wang
Main category: cs.CV
TL;DR: A method for building complete 3D digital twins of plants using inverse procedural modeling to overcome occlusion and complex geometry challenges in agricultural settings.
Details
Motivation: Current 3D reconstruction methods fail to recover complete plant shapes due to heavy occlusion and complex geometries, limiting applications in agriculture, environmental science, and robotics.Method: Uses inverse procedural modeling by first estimating depth maps via neural radiance field fitting, then optimizing a specialized loss to estimate morphological parameters that produce consistent depth renderings.
Result: Produces complete and biologically plausible 3D plant models validated on real agricultural field images, enabling various monitoring and simulation applications.
Conclusion: The method successfully creates complete 3D digital twins of plants that overcome occlusion challenges and can be used for practical agricultural applications.
Abstract: The ability to automatically build 3D digital twins of plants from images has countless applications in agriculture, environmental science, robotics, and other fields. However, current 3D reconstruction methods fail to recover complete shapes of plants due to heavy occlusion and complex geometries. In this work, we present a novel method for 3D modeling of agricultural crops based on optimizing a parametric model of plant morphology via inverse procedural modeling. Our method first estimates depth maps by fitting a neural radiance field and then optimizes a specialized loss to estimate morphological parameters that result in consistent depth renderings. The resulting 3D model is complete and biologically plausible. We validate our method on a dataset of real images of agricultural fields, and demonstrate that the reconstructed canopies can be used for a variety of monitoring and simulation applications.
[161] Language-guided Medical Image Segmentation with Target-informed Multi-level Contrastive Alignments
Mingjian Li, Mingyuan Meng, Shuchang Ye, Michael Fulham, Lei Bi, Jinman Kim
Main category: cs.CV
TL;DR: TMCA is a target-informed multi-level contrastive alignment framework that bridges image-text pattern gaps for medical language-guided segmentation by enabling fine-grained textual guidance through target-sensitive semantic distance modeling and multi-level contrastive alignment.
Details
Motivation: Existing language-guided segmentation methods neglect inherent pattern gaps between medical images and clinical reports, resulting in suboptimal visual-language integration. Current contrastive alignment techniques only align high-level global semantics without involving low-level localized target information, failing to provide fine-grained textual guidance on crucial medical image details.Method: Proposes TMCA framework with three key components: (1) target-sensitive semantic distance module for granular image-text alignment using target information, (2) multi-level contrastive alignment strategy for fine-grained textual guidance to multi-scale image details, and (3) language-guided target enhancement module that reinforces attention to critical image regions based on aligned image-text patterns.
Result: Extensive experiments on four public benchmark datasets demonstrate that TMCA enabled superior performance over state-of-the-art language-guided medical image segmentation methods.
Conclusion: TMCA effectively bridges image-text pattern gaps in medical language-guided segmentation by enabling target-informed alignments and fine-grained textual guidance, outperforming existing methods through better integration of clinical report semantics with medical image details.
Abstract: Medical image segmentation is a fundamental task in numerous medical engineering applications. Recently, language-guided segmentation has shown promise in medical scenarios where textual clinical reports are readily available as semantic guidance. Clinical reports contain diagnostic information provided by clinicians, which can provide auxiliary textual semantics to guide segmentation. However, existing language-guided segmentation methods neglect the inherent pattern gaps between image and text modalities, resulting in sub-optimal visual-language integration. Contrastive learning is a well-recognized approach to align image-text patterns, but it has not been optimized for bridging the pattern gaps in medical language-guided segmentation that relies primarily on medical image details to characterize the underlying disease/targets. Current contrastive alignment techniques typically align high-level global semantics without involving low-level localized target information, and thus cannot deliver fine-grained textual guidance on crucial image details. In this study, we propose a Target-informed Multi-level Contrastive Alignment framework (TMCA) to bridge image-text pattern gaps for medical language-guided segmentation. TMCA enables target-informed image-text alignments and fine-grained textual guidance by introducing: (i) a target-sensitive semantic distance module that utilizes target information for more granular image-text alignment modeling, (ii) a multi-level contrastive alignment strategy that directs fine-grained textual guidance to multi-scale image details, and (iii) a language-guided target enhancement module that reinforces attention to critical image regions based on the aligned image-text patterns. Extensive experiments on four public benchmark datasets demonstrate that TMCA enabled superior performance over state-of-the-art language-guided medical image segmentation methods.
[162] Boosting Generative Image Modeling via Joint Image-Feature Synthesis
Theodoros Kouzelis, Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, Nikos Komodakis
Main category: cs.CV
TL;DR: A new generative framework bridges representation learning and diffusion models by jointly modeling low-level image latents and high-level semantic features, improving image quality and training efficiency.
Details
Motivation: Current latent diffusion models dominate image generation but struggle to integrate representation learning with generative modeling, creating a gap between low-level latents and high-level semantics.Method: Proposes latent-semantic diffusion that uses a diffusion model to jointly model VAE image latents and pretrained self-supervised encoder features (like DINO), requiring minimal modifications to standard Diffusion Transformer architectures.
Result: The method significantly enhances generative quality and training efficiency, eliminates complex distillation objectives, and enables Representation Guidance for steering image generation with learned semantics.
Conclusion: Establishes a new direction for representation-aware generative modeling with substantial improvements in image quality and convergence speed in both conditional and unconditional settings.
Abstract: Latent diffusion models (LDMs) dominate high-quality image generation, yet integrating representation learning with generative modeling remains a challenge. We introduce a novel generative image modeling framework that seamlessly bridges this gap by leveraging a diffusion model to jointly model low-level image latents (from a variational autoencoder) and high-level semantic features (from a pretrained self-supervised encoder like DINO). Our latent-semantic diffusion approach learns to generate coherent image-feature pairs from pure noise, significantly enhancing both generative quality and training efficiency, all while requiring only minimal modifications to standard Diffusion Transformer architectures. By eliminating the need for complex distillation objectives, our unified design simplifies training and unlocks a powerful new inference strategy: Representation Guidance, which leverages learned semantics to steer and refine image generation. Evaluated in both conditional and unconditional settings, our method delivers substantial improvements in image quality and training convergence speed, establishing a new direction for representation-aware generative modeling. Project page and code: https://representationdiffusion.github.io
[163] Decoupling Multi-Contrast Super-Resolution: Self-Supervised Implicit Re-Representation for Unpaired Cross-Modal Synthesis
Yinzhe Wu, Hongyu Rui, Fanwen Wang, Jiahao Huang, Zhenxuan Zhang, Haosen Zhang, Zi Wang, Guang Yang
Main category: cs.CV
TL;DR: A novel two-stage framework for multi-contrast MRI super-resolution that combines population-level anatomical priors with patient-specific optimization, enabling arbitrary-scale upsampling without paired training data.
Details
Motivation: Current deep learning methods for multi-contrast MRI super-resolution have two major limitations: they require large paired LR/HR datasets (which are scarce) and are limited to fixed upsampling scales. Recent self-supervised methods remove the paired data requirement but fail to leverage valuable population-level anatomical priors.Method: The proposed framework decouples MCSR into two stages: (1) an unpaired cross-modal synthesis (uCMS) module trained once on unpaired population data to learn robust anatomical priors, and (2) a lightweight, patient-specific implicit re-representation (IrR) module optimized in a self-supervised manner to fuse population priors with the subject’s own LR target data. The IrR module uses implicit neural representations, making the framework inherently scale-agnostic.
Result: The method demonstrates superior quantitative performance on different datasets, with exceptional robustness at extreme scales (16x, 32x) where competing methods fail. It achieves high-fidelity, arbitrary-scale super-resolution without requiring any paired LR/HR or paired cross-modal training data.
Conclusion: This work presents a data-efficient, flexible, and computationally lightweight paradigm for multi-contrast MRI super-resolution that uniquely fuses population-level knowledge with patient-specific fidelity, enabling high-quality reconstruction at arbitrary scales.
Abstract: Multi-contrast super-resolution (MCSR) is crucial for enhancing MRI but current deep learning methods are limited. They typically require large, paired low- and high-resolution (LR/HR) training datasets, which are scarce, and are trained for fixed upsampling scales. While recent self-supervised methods remove the paired data requirement, they fail to leverage valuable population-level priors. In this work, we propose a novel, decoupled MCSR framework that resolves both limitations. We reformulate MCSR into two stages: (1) an unpaired cross-modal synthesis (uCMS) module, trained once on unpaired population data to learn a robust anatomical prior; and (2) a lightweight, patient-specific implicit re-representation (IrR) module. This IrR module is optimized in a self-supervised manner to fuse the population prior with the subject’s own LR target data. This design uniquely fuses population-level knowledge with patient-specific fidelity without requiring any paired LR/HR or paired cross-modal training data. By building the IrR module on an implicit neural representation, our framework is also inherently scale-agnostic. Our method demonstrates superior quantitative performance on different datasets, with exceptional robustness at extreme scales (16x, 32x), a regime where competing methods fail. Our work presents a data-efficient, flexible, and computationally lightweight paradigm for MCSR, enabling high-fidelity, arbitrary-scale
[164] Multi-View Projection for Unsupervised Domain Adaptation in 3D Semantic Segmentation
Andrew Caunes, Thierry Chateau, Vincent Fremont
Main category: cs.CV
TL;DR: A novel unsupervised domain adaptation approach for 3D semantic segmentation using multi-view projection and pseudo-label generation to address domain shift across different datasets.
Details
Motivation: State-of-the-art 3D semantic segmentation models suffer from severe domain shift when deployed across different datasets in autonomous driving applications, requiring adaptation without target domain annotations.Method: Aligns Lidar scans into coherent 3D scenes, renders them from multiple virtual camera poses to create synthetic 2D datasets, trains ensemble of 2D segmentation models, uses depth-aware voting to back-project logits to 3D pseudo-labels, then fine-tunes 3D model on target domain.
Result: Achieves state-of-the-art results in Real-to-Real Unsupervised Domain Adaptation on nuScenes and SemanticKITTI datasets, and successfully applies Simulation-to-Real adaptation with SynLidar dataset.
Conclusion: The proposed multi-view projection framework effectively addresses domain shift in 3D semantic segmentation and enables segmentation of rare classes using only 2D annotations by leveraging 3D annotations from source domain.
Abstract: 3D semantic segmentation plays a pivotal role in autonomous driving and road infrastructure analysis, yet state-of-the-art 3D models are prone to severe domain shift when deployed across different datasets. In this paper, we propose an Unsupervised Domain Adaptation approach where a 3D segmentation model is trained on the target dataset using pseudo-labels generated by a novel multi-view projection framework. Our approach first aligns Lidar scans into coherent 3D scenes and renders them from multiple virtual camera poses to create large-scale synthetic 2D semantic segmentation datasets in various modalities. The generated datasets are used to train an ensemble of 2D segmentation models in point cloud view domain on each modality. During inference, the models process a large amount of views per scene; the resulting logits are back-projected to 3D with a depth-aware voting scheme to generate final point-wise labels. These labels are then used to fine-tune a 3D segmentation model in the target domain. We evaluate our approach Real-to-Real on the nuScenes and SemanticKITTI datasets. We also evaluate it Simulation-to-Real with the SynLidar dataset. Our contributions are a novel method that achieves state-of-the-art results in Real-to-Real Unsupervised Domain Adaptation, and we also demonstrate an application of our method to segment rare classes, for which target 3D annotations are not available, by only using 2D annotations for those classes and leveraging 3D annotations for other classes in a source domain.
[165] CGS-GAN: 3D Consistent Gaussian Splatting GANs for High Resolution Human Head Synthesis
Florian Barthel, Wieland Morgenstern, Paul Hinzer, Anna Hilsmann, Peter Eisert
Main category: cs.CV
TL;DR: CGS-GAN is a novel 3D Gaussian Splatting GAN framework that enables stable training and high-quality 3D-consistent synthesis of human heads without view-conditioning, achieving resolutions up to 2048² with competitive FID scores.
Details
Motivation: Existing 3D GANs for human heads compromise 3D consistency by using view-conditioning (which causes identity changes with camera shifts) or produce poor novel views when fixing the camera. Removing view-conditioning typically destabilizes GAN training.Method: Introduces multi-view regularization for training stability, adapts conditional loss from existing 3D Gaussian splatting GANs, and designs a generator architecture for efficient rendering and scaling. Also curates a new high-resolution FFHQ-derived dataset focusing on larger head portions with reduced artifacts.
Result: Achieves very high rendering quality with competitive FID scores while ensuring consistent 3D scene generation. The framework supports output resolutions up to 2048² and enables stable training without view-conditioning.
Conclusion: CGS-GAN successfully addresses the 3D consistency vs. training stability trade-off in 3D GANs for human heads, enabling high-quality, view-consistent synthesis through novel regularization techniques and architectural improvements.
Abstract: Recently, 3D GANs based on 3D Gaussian splatting have been proposed for high quality synthesis of human heads. However, existing methods stabilize training and enhance rendering quality from steep viewpoints by conditioning the random latent vector on the current camera position. This compromises 3D consistency, as we observe significant identity changes when re-synthesizing the 3D head with each camera shift. Conversely, fixing the camera to a single viewpoint yields high-quality renderings for that perspective but results in poor performance for novel views. Removing view-conditioning typically destabilizes GAN training, often causing the training to collapse. In response to these challenges, we introduce CGS-GAN, a novel 3D Gaussian Splatting GAN framework that enables stable training and high-quality 3D-consistent synthesis of human heads without relying on view-conditioning. To ensure training stability, we introduce a multi-view regularization technique that enhances generator convergence with minimal computational overhead. Additionally, we adapt the conditional loss used in existing 3D Gaussian splatting GANs and propose a generator architecture designed to not only stabilize training but also facilitate efficient rendering and straightforward scaling, enabling output resolutions up to $2048^2$. To evaluate the capabilities of CGS-GAN, we curate a new dataset derived from FFHQ. This dataset enables very high resolutions, focuses on larger portions of the human head, reduces view-dependent artifacts for improved 3D consistency, and excludes images where subjects are obscured by hands or other objects. As a result, our approach achieves very high rendering quality, supported by competitive FID scores, while ensuring consistent 3D scene generation. Check our our project page here: https://fraunhoferhhi.github.io/cgs-gan/
[166] BAH Dataset for Ambivalence/Hesitancy Recognition in Videos for Digital Behavioural Change
Manuela González-González, Soufiane Belharbi, Muhammad Osama Zeeshan, Masoumeh Sharafi, Muhammad Haseeb Aslam, Marco Pedersoli, Alessandro Lameiras Koerich, Simon L Bacon, Eric Granger
Main category: cs.CV
TL;DR: Researchers created the first multimodal dataset (BAH) for recognizing ambivalence/hesitancy in videos to enable automatic detection in digital health interventions.
Details
Motivation: Ambivalence/hesitancy is a key barrier to health behavior change, but current digital interventions lack effective, cost-efficient ways to recognize it automatically. No existing datasets support machine learning model development for this purpose.Method: Collected 1,427 videos (10.6 hours) from 300 Canadian participants answering questions designed to elicit ambivalence/hesitancy. Videos were annotated by three experts with timestamps, frame/video-level annotations, transcripts, cropped faces, and metadata.
Result: Created the BAH dataset with binary annotations (presence/absence of A/H) and provided baseline benchmarking results for frame/video-level recognition, zero-shot prediction, and personalization using source-free domain adaptation.
Conclusion: The BAH dataset enables development of machine learning models for automatic ambivalence/hesitancy recognition, which is crucial for personalizing and improving cost-effectiveness of digital behavior change interventions.
Abstract: Ambivalence and hesitancy (A/H), a closely related construct, is the primary reasons why individuals delay, avoid, or abandon health behaviour changes. It is a subtle and conflicting emotion that sets a person in a state between positive and negative orientations, or between acceptance and refusal to do something. It manifests by a discord in affect between multiple modalities or within a modality, such as facial and vocal expressions, and body language. Although experts can be trained to recognize A/H as done for in-person interactions, integrating them into digital health interventions is costly and less effective. Automatic A/H recognition is therefore critical for the personalization and cost-effectiveness of digital behaviour change interventions. However, no datasets currently exists for the design of machine learning models to recognize A/H. This paper introduces the Behavioural Ambivalence/Hesitancy (BAH) dataset collected for multimodal recognition of A/H in videos. It contains 1,427 videos with a total duration of 10.60 hours captured from 300 participants across Canada answering predefined questions to elicit A/H. It is intended to mirror real-world online personalized behaviour change interventions. BAH is annotated by three experts to provide timestamps that indicate where A/H occurs, and frame- and video-level annotations with A/H cues. Video transcripts, cropped and aligned faces, and participants’ meta-data are also provided. Since A and H manifest similarly in practice, we provide a binary annotation indicating the presence or absence of A/H. Additionally, this paper includes benchmarking results using baseline models on BAH for frame- and video-level recognition, zero-shot prediction, and personalization using source-free domain adaptation. The data, code, and pretrained weights are available.
[167] OccLE: Label-Efficient 3D Semantic Occupancy Prediction
Naiyu Fang, Zheyuan Zhou, Fayao Liu, Xulei Yang, Jiacheng Wei, Lemiao Qiu, Hongsheng Li, Guosheng Lin
Main category: cs.CV
TL;DR: OccLE is a label-efficient 3D semantic occupancy prediction method that achieves competitive performance with only 10% voxel annotations by decoupling semantic and geometric learning tasks and fusing them via Dual Mamba.
Details
Motivation: Existing 3D semantic occupancy prediction methods either require costly full supervision with voxel-level annotations or use self-supervision with limited guidance and suboptimal performance. There's a need for label-efficient approaches that maintain high performance with limited annotations.Method: Decouples semantic and geometric learning tasks. Semantic branch distills 2D foundation models to provide aligned pseudo labels for 2D/3D semantic learning. Geometric branch integrates image and LiDAR inputs using cross-plane synergy with semi-supervision. Features are fused through Dual Mamba architecture with scatter-accumulated projection to supervise unannotated predictions with aligned pseudo labels.
Result: Achieves competitive performance with only 10% of voxel annotations on SemanticKITTI and Occ3D-nuScenes datasets, demonstrating label efficiency while maintaining high performance.
Conclusion: OccLE provides an effective label-efficient approach for 3D semantic occupancy prediction by decoupling semantic and geometric learning and leveraging pseudo labels from 2D foundation models, enabling high performance with significantly reduced annotation requirements.
Abstract: 3D semantic occupancy prediction offers an intuitive and efficient scene understanding and has attracted significant interest in autonomous driving perception. Existing approaches either rely on full supervision, which demands costly voxel-level annotations, or on self-supervision, which provides limited guidance and yields suboptimal performance. To address these challenges, we propose OccLE, a Label-Efficient 3D Semantic Occupancy Prediction that takes images and LiDAR as inputs and maintains high performance with limited voxel annotations. Our intuition is to decouple the semantic and geometric learning tasks and then fuse the learned feature grids from both tasks for the final semantic occupancy prediction. Therefore, the semantic branch distills 2D foundation model to provide aligned pseudo labels for 2D and 3D semantic learning. The geometric branch integrates image and LiDAR inputs in cross-plane synergy based on their inherency, employing semi-supervision to enhance geometry learning. We fuse semantic-geometric feature grids through Dual Mamba and incorporate a scatter-accumulated projection to supervise unannotated prediction with aligned pseudo labels. Experiments show that OccLE achieves competitive performance with only 10% of voxel annotations on the SemanticKITTI and Occ3D-nuScenes datasets. The code will be publicly released on https://github.com/NerdFNY/OccLE
[168] Skin Lesion Phenotyping via Nested Multi-modal Contrastive Learning
Dionysis Christopoulos, Sotiris Spanos, Eirini Baltzi, Valsamis Ntouskos, Konstantinos Karantzalos
Main category: cs.CV
TL;DR: SLIMP is a novel pre-training method for skin lesion analysis that uses nested contrastive learning to combine image data with patient metadata, improving classification performance over image-only approaches.
Details
Motivation: Current melanoma detection and skin lesion classification face challenges due to variations in imaging conditions and lack of clinical context. Clinicians use a holistic approach considering patient history and multiple lesions, which image-only AI models miss.Method: SLIMP uses nested contrastive learning to capture complex relationships between skin lesion images and metadata. It combines individual lesion appearance with patient-level metadata from medical records and other clinically relevant information.
Result: The proposed pre-training strategy outperforms other pre-training methods on downstream skin lesion classification tasks, demonstrating the quality of learned representations through better performance.
Conclusion: By fully exploiting all available data modalities (images + metadata) through nested contrastive learning, SLIMP creates richer representations that improve skin lesion classification, mimicking clinicians’ holistic approach.
Abstract: We introduce SLIMP (Skin Lesion Image-Metadata Pre-training) for learning rich representations of skin lesions through a novel nested contrastive learning approach that captures complex relationships between images and metadata. Melanoma detection and skin lesion classification based solely on images, pose significant challenges due to large variations in imaging conditions (lighting, color, resolution, distance, etc.) and lack of clinical and phenotypical context. Clinicians typically follow a holistic approach for assessing the risk level of the patient and for deciding which lesions may be malignant and need to be excised, by considering the patient’s medical history as well as the appearance of other lesions of the patient. Inspired by this, SLIMP combines the appearance and the metadata of individual skin lesions with patient-level metadata relating to their medical record and other clinically relevant information. By fully exploiting all available data modalities throughout the learning process, the proposed pre-training strategy improves performance compared to other pre-training strategies on downstream skin lesions classification tasks highlighting the learned representations quality.
[169] Rasterizing Wireless Radiance Field via Deformable 2D Gaussian Splatting
Mufan Liu, Cixiao Zhang, Qi Yang, Yujie Cao, Yiling Xu, Yin Xu, Shu Sun, Mingzeng Dai, Yunfeng Guan
Main category: cs.CV
TL;DR: SwiftWRF introduces deformable 2D Gaussian splatting for wireless radiance field modeling, achieving 500x faster reconstruction than state-of-the-art methods while maintaining high signal quality.
Details
Motivation: Traditional wireless radiance field modeling approaches suffer from limited accuracy or require strong scene priors, while recent NeRF-based methods are computationally expensive and not suitable for real-time deployment.Method: Proposes SwiftWRF, a deformable 2D Gaussian splatting framework that uses CUDA-accelerated rasterization for rendering spectra at over 100,000 fps and a lightweight MLP to model deformation of 2D Gaussians to capture mobility-induced variations.
Result: SwiftWRF reconstructs WRF spectra up to 500x faster than existing state-of-the-art methods while significantly enhancing signal quality, and demonstrates effectiveness in AoA and RSSI prediction applications.
Conclusion: SwiftWRF successfully brings Gaussian splatting efficiency to the wireless domain, enabling compact and accurate WRF reconstruction suitable for real-time deployment in communication systems.
Abstract: Modeling the wireless radiance field (WRF) is fundamental to modern communication systems, enabling key tasks such as localization, sensing, and channel estimation. Traditional approaches, which rely on empirical formulas or physical simulations, often suffer from limited accuracy or require strong scene priors. Recent neural radiance field (NeRF-based) methods improve reconstruction fidelity through differentiable volumetric rendering, but their reliance on computationally expensive multilayer perceptron (MLP) queries hinders real-time deployment. To overcome these challenges, we introduce Gaussian splatting (GS) to the wireless domain, leveraging its efficiency in modeling optical radiance fields to enable compact and accurate WRF reconstruction. Specifically, we propose SwiftWRF, a deformable 2D Gaussian splatting framework that synthesizes WRF spectra at arbitrary positions under single-sided transceiver mobility. SwiftWRF employs CUDA-accelerated rasterization to render spectra at over 100000 fps and uses a lightweight MLP to model the deformation of 2D Gaussians, effectively capturing mobility-induced WRF variations. In addition to novel spectrum synthesis, the efficacy of SwiftWRF is further underscored in its applications in angle-of-arrival (AoA) and received signal strength indicator (RSSI) prediction. Experiments conducted on both real-world and synthetic indoor scenes demonstrate that SwiftWRF can reconstruct WRF spectra up to 500x faster than existing state-of-the-art methods, while significantly enhancing its signal quality. The project page is https://evan-sudo.github.io/swiftwrf/.
[170] Dynamic Exploration on Segment-Proposal Graphs for Tubular Centerline Tracking
Chong Di, Jinglin Zhang, Zhenjiang Li, Jean-Marie Mirebeau, Da Chen, Laurent D. Cohen
Main category: cs.CV
TL;DR: Dynamic graph construction for tubular centerline tracking using Q-learning to build segment-proposal graphs on-demand during optimal path search.
Details
Motivation: Existing segment-wise methods for tubular centerline tracking use static graph construction that requires pre-computing all edges and weights, which can lead to search failure if the true path is absent from the candidate space. This static approach limits robustness in complex scenarios.Method: Proposes a dynamic exploration scheme where segment-proposal graphs are built on-demand during optimal path search. Formulates the problem as a Markov decision process and applies Q-learning to compute edge weights only for visited transitions, adaptively expanding the action space when connectivity is insufficient.
Result: Experimental results on retinal vessels, roads, and rivers demonstrate consistent improvements over state-of-the-art methods in both accuracy and efficiency.
Conclusion: Dynamic graph construction with Q-learning addresses limitations of static methods, providing more robust and efficient tubular centerline tracking by building graphs adaptively during search rather than requiring complete pre-computation.
Abstract: Optimal curve methods provide a fundamental framework for tubular centerline tracking. Point-wise approaches, such as minimal paths, are theoretically elegant but often suffer from shortcut and short-branch combination problems in complex scenarios. Nonlocal segment-wise methods address these issues by mapping pre-extracted centerline fragments onto a segment-proposal graph, performing optimization in this abstract space, and recovering the target tubular centerline from the resulting optimal path. In this paradigm, graph construction is critical, as it directly determines the quality of the final result. However, existing segment-wise methods construct graphs in a static manner, requiring all edges and their weights to be pre-computed, i.e. the graph must be sufficiently complete prior to search. Otherwise, the true path may be absent from the candidate space, leading to search failure. To address this limitation, we propose a dynamic exploration scheme for constructing segment-proposal graphs, where the graph is built on demand during the search for optimal paths. By formulating the problem as a Markov decision process, we apply Q-learning to compute edge weights only for visited transitions and adaptively expand the action space when connectivity is insufficient. Experimental results on retinal vessels, roads, and rivers demonstrate consistent improvements over state-of-the-art methods in both accuracy and efficiency.
[171] MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans
Shubhankar Borse, Seokeon Choi, Sunghyun Park, Jeongho Kim, Shreya Kadambi, Risheek Garrepalli, Sungrack Yun, Munawar Hayat, Fatih Porikli
Main category: cs.CV
TL;DR: Introduces MultiHuman-Testbench, a benchmark for evaluating multi-human image generation with 1,800 samples, 5,550 unique faces, and comprehensive evaluation metrics.
Details
Motivation: There's a lack of dedicated benchmarks for evaluating generative models that create images with multiple humans performing complex actions while preserving facial identities.Method: Created a benchmark with 1,800 samples including curated text prompts and 5,550 unique human face images with diversity across age, ethnicity, and gender. Provided pose conditioning images and proposed a multi-faceted evaluation suite with four key metrics for face count, ID similarity, prompt alignment, and action detection.
Result: Conducted thorough evaluation of diverse models (zero-shot and training-based methods) and proposed novel techniques using human segmentation and Hungarian matching that significantly improve ID similarity.
Conclusion: The benchmark provides valuable insights and a standardized tool for advancing research in multi-human image generation, with dataset and evaluation codes made publicly available.
Abstract: Generation of images containing multiple humans, performing complex actions, while preserving their facial identities, is a significant challenge. A major factor contributing to this is the lack of a dedicated benchmark. To address this, we introduce MultiHuman-Testbench, a novel benchmark for rigorously evaluating generative models for multi-human generation. The benchmark comprises 1,800 samples, including carefully curated text prompts, describing a range of simple to complex human actions. These prompts are matched with a total of 5,550 unique human face images, sampled uniformly to ensure diversity across age, ethnic background, and gender. Alongside captions, we provide human-selected pose conditioning images which accurately match the prompt. We propose a multi-faceted evaluation suite employing four key metrics to quantify face count, ID similarity, prompt alignment, and action detection. We conduct a thorough evaluation of a diverse set of models, including zero-shot approaches and training-based methods, with and without regional priors. We also propose novel techniques to incorporate image and region isolation using human segmentation and Hungarian matching, significantly improving ID similarity. Our proposed benchmark and key findings provide valuable insights and a standardized tool for advancing research in multi-human image generation. The dataset and evaluation codes will be available at https://github.com/Qualcomm-AI-research/MultiHuman-Testbench.
[172] A Segmentation-driven Editing Method for Bolt Defect Augmentation and Detection
Yangjie Xiao, Ke Zhang, Jiacun Wang, Xin Sheng, Yurong Guo, Meijuan Chen, Zehua Ren, Zhaoye Zheng, Zhenbing Zhao
Main category: cs.CV
TL;DR: A segmentation-driven bolt defect editing method (SBDE) that generates synthetic defect images to augment imbalanced datasets for transmission line bolt defect detection.
Details
Motivation: Bolt defect detection is critical for transmission line safety, but suffers from scarcity of defect images and imbalanced data distributions that limit detection performance.Method: Three-stage approach: 1) Bolt-SAM segmentation model with CLAHE-FFT Adapter and Multipart-Aware Mask Decoder for high-quality mask generation; 2) MOD-LaMa editing model that converts normal bolts to defective ones; 3) Editing Recovery Augmentation strategy to place edited defects back into original scenes.
Result: SBDE-generated defect images outperform state-of-the-art image editing models and significantly improve bolt defect detection performance across multiple constructed datasets.
Conclusion: The proposed SBDE method effectively addresses data scarcity and imbalance issues in bolt defect detection, demonstrating strong effectiveness and practical application potential for industrial inspection tasks.
Abstract: Bolt defect detection is critical to ensure the safety of transmission lines. However, the scarcity of defect images and imbalanced data distributions significantly limit detection performance. To address this problem, we propose a segmentationdriven bolt defect editing method (SBDE) to augment the dataset. First, a bolt attribute segmentation model (Bolt-SAM) is proposed, which enhances the segmentation of complex bolt attributes through the CLAHE-FFT Adapter (CFA) and Multipart- Aware Mask Decoder (MAMD), generating high-quality masks for subsequent editing tasks. Second, a mask optimization module (MOD) is designed and integrated with the image inpainting model (LaMa) to construct the bolt defect attribute editing model (MOD-LaMa), which converts normal bolts into defective ones through attribute editing. Finally, an editing recovery augmentation (ERA) strategy is proposed to recover and put the edited defect bolts back into the original inspection scenes and expand the defect detection dataset. We constructed multiple bolt datasets and conducted extensive experiments. Experimental results demonstrate that the bolt defect images generated by SBDE significantly outperform state-of-the-art image editing models, and effectively improve the performance of bolt defect detection, which fully verifies the effectiveness and application potential of the proposed method. The code of the project is available at https://github.com/Jay-xyj/SBDE.
[173] Skywork UniPic 2.0: Building Kontext Model with Online RL for Unified Multimodal Model
Hongyang Wei, Baixin Xu, Hongbo Liu, Size Wu, Jie Liu, Yi Peng, Peiyu Wang, Zexiang Liu, Jingwen He, Yidan Xietian, Chuanxin Tang, Zidong Wang, Yichen Wei, Liang Hu, Boyi Jiang, Wei Li, Ying He, Yang Liu, Xuchen Song, Yangguang Li, Yahui Zhou
Main category: cs.CV
TL;DR: UniPic2-SD3.5M-Kontext is a 2B-parameter DiT model that achieves SOTA image generation/editing through architectural modifications, large-scale pre-training, and a novel Progressive Dual-Task Reinforcement strategy, outperforming larger models like BAGEL (7B) and Flux-Kontext (12B).
Details
Motivation: Many open-source multimodal models prioritize scaling parameters over optimizing training strategies, limiting efficiency and performance. The authors aim to develop a more efficient model with better training strategies.Method: 1) Architectural modifications to SD3.5-Medium, 2) Large-scale pre-training on high-quality data for joint text-to-image generation/editing, 3) Progressive Dual-Task Reinforcement (PDTR) strategy to enhance instruction following and editing consistency, 4) Extension to unified multimodal framework (UniPic2-Metaquery) by connecting with Qwen2.5-VL-7B.
Result: UniPic2-SD3.5M-Kontext outperforms larger models (BAGEL 7B, Flux-Kontext 12B) in image generation/editing. UniPic2-Metaquery achieves top-tier performance across diverse multimodal tasks with simple, scalable training paradigm.
Conclusion: The proposed training paradigm (Skywork UniPic 2.0) is effective and generalizable, demonstrating that optimized training strategies can achieve better performance than simply scaling model parameters.
Abstract: Recent advances in multimodal models have demonstrated impressive capabilities in unified image generation and editing. However, many prominent open-source models prioritize scaling model parameters over optimizing training strategies, limiting their efficiency and performance. In this work, we present UniPic2-SD3.5M-Kontext, a 2B-parameter DiT model based on SD3.5-Medium, which achieves state-of-the-art image generation and editing while extending seamlessly into a unified multimodal framework. Our approach begins with architectural modifications to SD3.5-Medium and large-scale pre-training on high-quality data, enabling joint text-to-image generation and editing capabilities. To enhance instruction following and editing consistency, we propose a novel Progressive Dual-Task Reinforcement strategy (PDTR), which effectively strengthens both tasks in a staged manner. We empirically validate that the reinforcement phases for different tasks are mutually beneficial and do not induce negative interference. After pre-training and reinforcement strategies, UniPic2-SD3.5M-Kontext demonstrates stronger image generation and editing capabilities than models with significantly larger generation parameters-including BAGEL (7B) and Flux-Kontext (12B). Furthermore, following the MetaQuery, we connect the UniPic2-SD3.5M-Kontext and Qwen2.5-VL-7B via a connector and perform joint training to launch a unified multimodal model UniPic2-Metaquery. UniPic2-Metaquery integrates understanding, generation, and editing, achieving top-tier performance across diverse tasks with a simple and scalable training paradigm. This consistently validates the effectiveness and generalizability of our proposed training paradigm, which we formalize as Skywork UniPic 2.0.
[174] No Mesh, No Problem: Estimating Coral Volume and Surface from Sparse Multi-View Images
Diego Eustachio Farchione, Ramzi Idoughi, Peter Wonka
Main category: cs.CV
TL;DR: A lightweight learning framework predicts 3D volume and surface area of coral-like objects from 2D multi-view RGB images using pre-trained feature extraction, point cloud fusion, and dual DGCNN decoders with uncertainty estimation.
Details
Motivation: Coral reef monitoring requires accurate volumetric and surface area measurements for growth analysis, but coral complex morphology makes this challenging. Current methods need efficient, scalable solutions that work directly from sparse image sets.Method: Uses pre-trained VGGT to extract dense point maps from multi-view RGB images, merges them into unified point clouds with confidence scores, then processes through two parallel DGCNN decoder heads to jointly predict volume and surface area with confidence estimates. Employs composite loss function based on Gaussian negative log-likelihood in real and log domains for stability and uncertainty estimation.
Result: Achieves competitive accuracy and generalizes well to unseen coral morphologies. The framework enables efficient coral geometry estimation directly from sparse image sets.
Conclusion: Proposes a scalable learning framework for coral geometry estimation from 2D images that paves the way for efficient reef monitoring and coral growth analysis with practical applications in marine conservation.
Abstract: Effective reef monitoring requires the quantification of coral growth via accurate volumetric and surface area estimates, which is a challenging task due to the complex morphology of corals. We propose a novel, lightweight, and scalable learning framework that addresses this challenge by predicting the 3D volume and surface area of coral-like objects from 2D multi-view RGB images. Our approach utilizes a pre-trained module (VGGT) to extract dense point maps from each view; these maps are merged into a unified point cloud and enriched with per-view confidence scores. The resulting cloud is fed to two parallel DGCNN decoder heads, which jointly output the volume and the surface area of the coral, as well as their corresponding confidence estimate. To enhance prediction stability and provide uncertainty estimates, we introduce a composite loss function based on Gaussian negative log-likelihood in both real and log domains. Our method achieves competitive accuracy and generalizes well to unseen morphologies. This framework paves the way for efficient and scalable coral geometry estimation directly from a sparse set of images, with potential applications in coral growth analysis and reef monitoring.
[175] DF-LLaVA: Unlocking MLLM’s potential for Synthetic Image Detection via Prompt-Guided Knowledge Injection
Zhuokang Shen, Kaisen Zhang, Bohan Jia, Heming Jia, Yuan Fang, Zhou Yu, Shaohui Lin
Main category: cs.CV
TL;DR: DF-LLaVA is a framework that enhances MLLMs’ synthetic image detection accuracy to surpass expert models while maintaining interpretability by extracting and injecting latent knowledge via prompts.
Details
Motivation: Existing synthetic image detection models provide only binary judgments with limited explanatory insights, while MLLM-based methods offer interpretability but lag behind expert models in classification accuracy.Method: DF-LLaVA extracts latent knowledge from MLLMs and injects it into training via prompts, unlocking the intrinsic discrimination potential of MLLMs while maintaining their interpretability.
Result: The framework achieves outstanding detection accuracy exceeding expert models while maintaining MLLM interpretability, as confirmed by extensive experiments.
Conclusion: DF-LLaVA successfully addresses the trade-off between accuracy and interpretability in synthetic image detection, providing both high accuracy and explainability.
Abstract: With the increasing prevalence of synthetic images, evaluating image authenticity and locating forgeries accurately while maintaining human interpretability remains a challenging task. Existing detection models primarily focus on simple authenticity classification, ultimately providing only a forgery probability or binary judgment, which offers limited explanatory insights into image authenticity. Moreover, while MLLM-based detection methods can provide more interpretable results, they still lag behind expert models in terms of pure authenticity classification accuracy. To address this, we propose DF-LLaVA, a simple yet effective framework that unlocks the intrinsic discrimination potential of MLLMs. Our approach first extracts latent knowledge from MLLMs and then injects it into training via prompts. This framework allows LLaVA to achieve outstanding detection accuracy exceeding expert models while still maintaining the interpretability offered by MLLMs. Extensive experiments confirm the superiority of our DF-LLaVA, achieving both high accuracy and explainability in synthetic image detection. Code is available online at: https://github.com/Eliot-Shen/DF-LLaVA.
[176] From Canopy to Ground via ForestGen3D: Learning Cross-Domain Generation of 3D Forest Structure from Aerial-to-Terrestrial LiDAR
Juan Castorena, E. Louise Loudermilk, Scott Pokswinski, Rodman Linn
Main category: cs.CV
TL;DR: ForestGen3D uses diffusion models to generate realistic sub-canopy forest structure from aerial LiDAR, enabling landscape-scale 3D reconstruction without expensive terrestrial scanning.
Details
Motivation: Accurate 3D vegetation structure measurement is critical for ecological analysis but terrestrial LiDAR is expensive and infeasible at landscape scales. Aerial LiDAR captures canopy structure but misses sub-canopy detail needed for complete ecological understanding.Method: Conditional denoising diffusion probabilistic models trained on co-registered aerial and terrestrial LiDAR data. The model generates realistic TLS-like point clouds that preserve ALS geometry while inferring missing sub-canopy structure.
Result: Produces high-fidelity reconstructions matching TLS reference data in 3D structural similarity and biophysical metrics (tree height, DBH, crown diameter, volume). Introduces Expected Point Containment (EPC) metric for quality assessment without ground truth.
Conclusion: ForestGen3D enhances ALS-only environments by inferring ecologically plausible sub-canopy structure while preserving landscape heterogeneity, providing richer 3D representations for ecological analysis and remote sensing applications.
Abstract: The 3D structure of living and non-living components in ecosystems plays a critical role in determining ecological processes and feedbacks from both natural and human-driven disturbances. Anticipating the effects of wildfire, drought, disease, or atmospheric deposition depends on accurate characterization of 3D vegetation structure, yet widespread measurement remains prohibitively expensive and often infeasible. We present ForestGen3D, a cross-domain generative framework that preserves aerial LiDAR (ALS) observed 3D forest structure while inferring missing sub-canopy detail. ForestGen3D is based on conditional denoising diffusion probabilistic models trained on co-registered ALS and terrestrial LiDAR (TLS) data. The model generates realistic TLS-like point clouds that remain spatially consistent with ALS geometry, enabling landscape-scalable reconstruction of full vertical forest structure. We evaluate ForestGen3D at tree, plot, and landscape scales using real-world data from mixed conifer ecosystems, and show through qualitative and quantitative geometric and distributional analyses that it produces high-fidelity reconstructions closely matching TLS reference data in terms of 3D structural similarity and downstream biophysical metrics, including tree height, DBH, crown diameter, and crown volume. We further introduce and demonstrate the expected point containment (EPC) metric which serves as a practical proxy for generation quality in settings where TLS ground truth is unavailable. Our results demonstrate that ForestGen3D enhances the utility of ALS only environments by inferring ecologically plausible sub-canopy structure while faithfully preserving the landscape heterogeneity encoded in ALS observations, thereby providing a richer 3D representation for ecological analysis, structural fuel characterization and related remote sensing applications.
[177] VideoPro: Adaptive Program Reasoning for Long Video Understanding
Chenglin Li, Feng Han, Yikun Wang, Ruilin Li, Shuai Dong, Haowen Hou, Haitao Li, Qianglong Chen, Feng Tao, Jingqi Tong, Yin Zhang, Jiaqi Wang
Main category: cs.CV
TL;DR: FS-VisPR is an adaptive visual program reasoning framework that balances fast reasoning for simple queries with slow reasoning for difficult ones, improving efficiency and reliability in visual program workflows for long-form video tasks.
Details
Motivation: Previous approaches for visual program workflows rely on closed-source models, lack systematic reasoning, and struggle with long-form video question answering. There's a need for an adaptive system that can handle both simple and complex queries efficiently.Method: 1) Design efficient visual modules for long-form video tasks; 2) Construct fast-slow reasoning dataset and train FS-LLM; 3) Implement adaptive framework: simple queries → VideoLLMs, difficult queries → visual program reasoning with fallback mechanisms; 4) Improve programs through parameter search during training and inference.
Result: FS-VisPR achieves 50.4% accuracy on LVBench (surpassing GPT-4o) and matches Qwen2.5VL-72B performance on VideoMME, demonstrating improved efficiency and reliability in visual program workflows.
Conclusion: The FS-VisPR framework successfully addresses limitations of previous approaches by providing an adaptive visual program reasoning system that balances efficiency and accuracy, making it suitable for practical long-form video question answering applications.
Abstract: Large language models (LLMs) have shown promise in generating program workflows for visual tasks. However, previous approaches often rely on closed-source models, lack systematic reasoning, and struggle with long-form video question answering (videoQA). To address these challenges, we introduce the FS-VisPR framework, an adaptive visual program reasoning approach that balances fast reasoning for simple queries with slow reasoning for difficult ones. First, we design efficient visual modules (e.g., key clip retrieval and subtitle retrieval) to support long-form video tasks. Then, we construct a diverse and high-quality fast-slow reasoning dataset with a strong LLM to align open-source language models’ ability to generate visual program workflows as FS-LLM. Next, we design a fast-slow reasoning framework with FS-LLM: Simple queries are directly solved by VideoLLMs, while difficult ones invoke visual program reasoning, motivated by human-like reasoning processes. During this process, low-confidence fast-thinking answers will trigger a second-stage slow-reasoning process, and a fallback mechanism to fast reasoning is activated if the program execution fails. Moreover, we improve visual programs through parameter search during both training and inference. By adjusting the parameters of the visual modules within the program, multiple variants are generated: during training, programs that yield correct answers are selected, while during inference, the program with the highest confidence result is applied. Experiments show that FS-VisPR improves both efficiency and reliability in visual program workflows. It achieves 50.4% accuracy on LVBench, surpassing GPT-4o, matching the performance of Qwen2.5VL-72B on VideoMME.
[178] Real-Time Object Detection Meets DINOv3
Shihua Huang, Yongjie Hou, Longfei Liu, Xuanlong Yu, Xi Shen
Main category: cs.CV
TL;DR: DEIMv2 extends DEIM with DINOv3 features across 8 model sizes, achieving SOTA performance with better parameter efficiency from X to Atto scales.
Details
Motivation: To extend the successful DEIM framework with DINOv3 features to create a unified family of models covering diverse deployment scenarios (GPU, edge, mobile) with superior performance-cost trade-offs.Method: For larger models (X, L, M, S): use DINOv3-pretrained/distilled backbones with Spatial Tuning Adapter (STA) to convert single-scale to multi-scale features. For ultra-lightweight models (Nano, Pico, Femto, Atto): use HGNetv2 with depth/width pruning, simplified decoder, and upgraded Dense O2O.
Result: DEIMv2-X achieves 57.8 AP with 50.3M params, surpassing prior X-scale models. DEIMv2-S (9.71M params) exceeds 50 AP milestone (50.9 AP). DEIMv2-Pico (1.5M params) delivers 38.5 AP, matching YOLOv10-Nano with ~50% fewer parameters.
Conclusion: DEIMv2 establishes new SOTA across diverse model sizes with superior parameter efficiency, covering GPU to mobile deployment scenarios through a unified design approach.
Abstract: Benefiting from the simplicity and effectiveness of Dense O2O and MAL, DEIM has become the mainstream training framework for real-time DETRs, significantly outperforming the YOLO series. In this work, we extend it with DINOv3 features, resulting in DEIMv2. DEIMv2 spans eight model sizes from X to Atto, covering GPU, edge, and mobile deployment. For the X, L, M, and S variants, we adopt DINOv3-pretrained or distilled backbones and introduce a Spatial Tuning Adapter (STA), which efficiently converts DINOv3’s single-scale output into multi-scale features and complements strong semantics with fine-grained details to enhance detection. For ultra-lightweight models (Nano, Pico, Femto, and Atto), we employ HGNetv2 with depth and width pruning to meet strict resource budgets. Together with a simplified decoder and an upgraded Dense O2O, this unified design enables DEIMv2 to achieve a superior performance-cost trade-off across diverse scenarios, establishing new state-of-the-art results. Notably, our largest model, DEIMv2-X, achieves 57.8 AP with only 50.3 million parameters, surpassing prior X-scale models that require over 60 million parameters for just 56.5 AP. On the compact side, DEIMv2-S is the first sub-10 million model (9.71 million) to exceed the 50 AP milestone on COCO, reaching 50.9 AP. Even the ultra-lightweight DEIMv2-Pico, with just 1.5 million parameters, delivers 38.5 AP, matching YOLOv10-Nano (2.3 million) with around 50 percent fewer parameters. Our code and pre-trained models are available at https://github.com/Intellindust-AI-Lab/DEIMv2
[179] PatchEAD: Unifying Industrial Visual Prompting Frameworks for Patch-Exclusive Anomaly Detection
Po-Han Huang, Jeng-Lin Li, Po-Hsuan Huang, Ming-Ching Chang, Wei-Chao Chen
Main category: cs.CV
TL;DR: PatchEAD is a unified patch-focused framework for training-free anomaly detection that works with diverse foundation models using visual prompting techniques, achieving superior few-shot and zero-shot performance without textual features.
Details
Motivation: Current industrial anomaly detection relies heavily on textual prompt tuning with foundation models, leaving visual processing fragmented and model-specific. The authors aim to create a unified visual framework that works across different foundation models without requiring textual features or extensive training.Method: Proposes Patch-Exclusive Anomaly Detection (PatchEAD) framework with visual prompting techniques including an alignment module and foreground masking. The approach is training-free and patch-focused, enabling compatibility with diverse foundation models without textual features.
Result: Superior few-shot and batch zero-shot performance compared to prior work, despite using no textual features. The study also examines how backbone structure and pretrained characteristics affect patch-similarity robustness, providing guidance for selecting foundation models for real-world visual inspection.
Conclusion: A well-unified patch-only framework enables quick, calibration-light deployment without carefully engineered textual prompts, confirming that visual-only approaches can achieve strong anomaly detection performance with foundation models.
Abstract: Industrial anomaly detection is increasingly relying on foundation models, aiming for strong out-of-distribution generalization and rapid adaptation in real-world deployments. Notably, past studies have primarily focused on textual prompt tuning, leaving the intrinsic visual counterpart fragmented into processing steps specific to each foundation model. We aim to address this limitation by proposing a unified patch-focused framework, Patch-Exclusive Anomaly Detection (PatchEAD), enabling training-free anomaly detection that is compatible with diverse foundation models. The framework constructs visual prompting techniques, including an alignment module and foreground masking. Our experiments show superior few-shot and batch zero-shot performance compared to prior work, despite the absence of textual features. Our study further examines how backbone structure and pretrained characteristics affect patch-similarity robustness, providing actionable guidance for selecting and configuring foundation models for real-world visual inspection. These results confirm that a well-unified patch-only framework can enable quick, calibration-light deployment without the need for carefully engineered textual prompts.
[180] DECOR: Deep Embedding Clustering with Orientation Robustness
Fiona Victoria Stanley Jothiraj, Arunaggiri Pandian Karunanidhi, Seth A. Eichmeyer
Main category: cs.CV
TL;DR: DECOR is a deep clustering framework for wafer defect pattern analysis that handles orientation variations, complex unlabeled data, and multiple defects per wafer without manual tuning.
Details
Motivation: Early detection of wafer defects is critical for semiconductor manufacturing yield optimization, but raw wafer data is complex, unlabeled, imbalanced, and can contain multiple defects on single wafers, requiring robust clustering methods for imperfect data conditions.Method: DECOR (Deep Clustering with Orientation Robustness) framework groups complex defect patterns from wafer maps into consistent clusters by explicitly accounting for orientation variations, ensuring spatially similar defects are clustered regardless of rotation or alignment.
Result: DECOR outperforms existing clustering baseline methods on the open source MixedWM38 dataset, demonstrating ability to discover clusters without manual tuning.
Conclusion: DECOR provides a reliable and scalable solution for automated visual inspection systems in semiconductor manufacturing by handling orientation variations and complex wafer defect data.
Abstract: In semiconductor manufacturing, early detection of wafer defects is critical for product yield optimization. However, raw wafer data from wafer quality tests are often complex, unlabeled, imbalanced and can contain multiple defects on a single wafer, making it crucial to design clustering methods that remain reliable under such imperfect data conditions. We introduce DECOR, a deep clustering with orientation robustness framework that groups complex defect patterns from wafer maps into consistent clusters. We evaluate our method on the open source MixedWM38 dataset, demonstrating its ability to discover clusters without manual tuning. DECOR explicitly accounts for orientation variations in wafer maps, ensuring that spatially similar defects are consistently clustered regardless of its rotation or alignment. Experiments indicate that our method outperforms existing clustering baseline methods, thus providing a reliable and scalable solution in automated visual inspection systems.
[181] Auditing and Mitigating Bias in Gender Classification Algorithms: A Data-Centric Approach
Tadesse K Bahiru, Natnael Tilahun Sinshaw, Teshager Hailemariam Moges, Dheeraj Kumar Singh
Main category: cs.CV
TL;DR: The paper addresses gender classification bias by creating BalancedFace, a new dataset that equalizes demographic representation across 189 intersections of age, race, and gender, reducing fairness gaps by over 50% compared to existing datasets.
Details
Motivation: Gender classification systems inherit and amplify demographic imbalances from training data. An audit of five widely used datasets revealed significant intersectional underrepresentation, leading to biased models that misclassify female faces more often than male faces and amplify racial skew.Method: 1) Audited five gender classification datasets for intersectional representation; 2) Trained identical MobileNetV2 classifiers on UTKFace and FairFace to measure bias; 3) Created BalancedFace by blending images from FairFace and UTKFace, supplemented with other collections to fill demographic gaps; 4) Engineered the dataset to equalize subgroup shares across 189 intersections of age, race, and gender using only real, unedited images.
Result: BalancedFace reduces the maximum True Positive Rate gap across racial subgroups by over 50% and brings the average Disparate Impact score 63% closer to the ideal of 1.0 compared to the next-best dataset, with minimal loss of overall accuracy.
Conclusion: Data-centric interventions like BalancedFace are profoundly valuable for fair gender classification research. The openly available dataset demonstrates that carefully engineered training data can significantly reduce bias while maintaining accuracy, highlighting the importance of addressing data imbalances rather than just algorithmic fixes.
Abstract: Gender classification systems often inherit and amplify demographic imbalances in their training data. We first audit five widely used gender classification datasets, revealing that all suffer from significant intersectional underrepresentation. To measure the downstream impact of these flaws, we train identical MobileNetV2 classifiers on the two most balanced of these datasets, UTKFace and FairFace. Our fairness evaluation shows that even these models exhibit significant bias, misclassifying female faces at a higher rate than male faces and amplifying existing racial skew. To counter these data-induced biases, we construct BalancedFace, a new public dataset created by blending images from FairFace and UTKFace, supplemented with images from other collections to fill missing demographic gaps. It is engineered to equalize subgroup shares across 189 intersections of age, race, and gender using only real, unedited images. When a standard classifier is trained on BalancedFace, it reduces the maximum True Positive Rate gap across racial subgroups by over 50% and brings the average Disparate Impact score 63% closer to the ideal of 1.0 compared to the next-best dataset, all with a minimal loss of overall accuracy. These results underline the profound value of data-centric interventions and provide an openly available resource for fair gender classification research.
[182] Yesnt: Are Diffusion Relighting Models Ready for Capture Stage Compositing? A Hybrid Alternative to Bridge the Gap
Elisabeth Jüttner, Janelle Pfeifer, Leona Krath, Stefan Korfhage, Hannah Dröge, Matthias B. Hullin, Markus Plack
Main category: cs.CV
TL;DR: Hybrid framework combining diffusion material priors with temporal regularization and physical rendering for stable volumetric video relighting.
Details
Motivation: Current volumetric video relighting methods lack temporal stability for production use. Diffusion-based methods work for single frames but suffer from noise and instability in sequences, while video diffusion models face memory and scale limitations.Method: Hybrid approach combining: 1) Diffusion-derived material priors aggregated into temporally consistent shading using optical-flow-guided regularization, and 2) Mesh proxy extraction from Gaussian Opacity Fields for indirect effects rendered in standard graphics pipeline.
Result: Achieves substantially more stable relighting across sequences than diffusion-only baselines, scales beyond feasible clip lengths for video diffusion, works on both real and synthetic captures.
Conclusion: Hybrid approaches balancing learned priors with physically grounded constraints represent a practical step toward production-ready volumetric video relighting.
Abstract: Volumetric video relighting is essential for bringing captured performances into virtual worlds, but current approaches struggle to deliver temporally stable, production-ready results. Diffusion-based intrinsic decomposition methods show promise for single frames, yet suffer from stochastic noise and instability when extended to sequences, while video diffusion models remain constrained by memory and scale. We propose a hybrid relighting framework that combines diffusion-derived material priors with temporal regularization and physically motivated rendering. Our method aggregates multiple stochastic estimates of per-frame material properties into temporally consistent shading components, using optical-flow-guided regularization. For indirect effects such as shadows and reflections, we extract a mesh proxy from Gaussian Opacity Fields and render it within a standard graphics pipeline. Experiments on real and synthetic captures show that this hybrid strategy achieves substantially more stable relighting across sequences than diffusion-only baselines, while scaling beyond the clip lengths feasible for video diffusion. These results indicate that hybrid approaches, which balance learned priors with physically grounded constraints, are a practical step toward production-ready volumetric video relighting.
[183] PlantTraitNet: An Uncertainty-Aware Multimodal Framework for Global-Scale Plant Trait Inference from Citizen Science Data
Ayushi Sharma, Johanna Trost, Daniel Lusk, Johannes Dollinger, Julian Schrader, Christian Rossi, Javier Lopatin, Etienne Laliberté, Simon Haberstroh, Jana Eichel, Daniel Mederer, Jose Miguel Cerda-Paredes, Shyam S. Phartyal, Lisa-Maricia Schwarz, Anja Linstädter, Maria Conceição Caldeira, Teja Kattenborn
Main category: cs.CV
TL;DR: PlantTraitNet uses citizen science photos with deep learning to create more accurate global plant trait maps than existing methods.
Details
Motivation: Existing global plant trait maps are limited by expensive field measurements with sparse geographic coverage, while citizen science initiatives provide millions of geotagged plant photos that could overcome these limitations.Method: PlantTraitNet is a multi-modal, multi-task uncertainty-aware deep learning framework that predicts four key plant traits (height, leaf area, specific leaf area, nitrogen content) from citizen science photos using weak supervision, then aggregates predictions to generate global trait distribution maps.
Result: PlantTraitNet consistently outperforms existing global trait products across all evaluated traits when validated against independent vegetation survey data (sPlotOpen), demonstrating more accurate trait mapping.
Conclusion: Citizen science imagery combined with computer vision and geospatial AI enables not only scalable but also more accurate global trait mapping, offering a powerful new pathway for ecological research and Earth system modeling.
Abstract: Global plant maps of plant traits, such as leaf nitrogen or plant height, are essential for understanding ecosystem processes, including the carbon and energy cycles of the Earth system. However, existing trait maps remain limited by the high cost and sparse geographic coverage of field-based measurements. Citizen science initiatives offer a largely untapped resource to overcome these limitations, with over 50 million geotagged plant photographs worldwide capturing valuable visual information on plant morphology and physiology. In this study, we introduce PlantTraitNet, a multi-modal, multi-task uncertainty-aware deep learning framework that predictsfour key plant traits (plant height, leaf area, specific leaf area, and nitrogen content) from citizen science photos using weak supervision. By aggregating individual trait predictions across space, we generate global maps of trait distributions. We validate these maps against independent vegetation survey data (sPlotOpen) and benchmark them against leading global trait products. Our results show that PlantTraitNet consistently outperforms existing trait maps across all evaluated traits, demonstrating that citizen science imagery, when integrated with computer vision and geospatial AI, enables not only scalable but also more accurate global trait mapping. This approach offers a powerful new pathway for ecological research and Earth system modeling.
[184] Divide, Conquer and Unite: Hierarchical Style-Recalibrated Prototype Alignment for Federated Medical Segmentation
Xingyue Zhao, Wenke Huang, Xingguang Wang, Haoyu Zhao, Linghao Zhuang, Anwen Jiang, Guancheng Wan, Mang Ye
Main category: cs.CV
TL;DR: FedBCS addresses feature heterogeneity in federated medical image segmentation by aligning domain-invariant contextual prototypes across institutions, using frequency-domain style recalibration and dual-level prototype alignment.
Details
Motivation: Feature heterogeneity from diverse medical scanners/protocols in federated learning causes incomplete contextual representation learning (focusing only on final-layer features) and layerwise style bias accumulation, reducing segmentation accuracy and model robustness.Method: Proposes FedBCS with: 1) Frequency-domain adaptive style recalibration for prototype construction that decouples content-style representations and learns optimal style parameters; 2) Context-aware dual-level prototype alignment that extracts domain-invariant prototypes from different encoder/decoder layers and fuses them with contextual information.
Result: Extensive experiments on two public datasets demonstrate remarkable performance improvements over existing methods.
Conclusion: FedBCS effectively bridges feature representation gaps in federated medical image segmentation by addressing both incomplete contextual learning and style bias accumulation through domain-invariant prototype alignment.
Abstract: Federated learning enables multiple medical institutions to train a global model without sharing data, yet feature heterogeneity from diverse scanners or protocols remains a major challenge. Many existing works attempt to address this issue by leveraging model representations (e.g., mean feature vectors) to correct local training; however, they often face two key limitations: 1) Incomplete Contextual Representation Learning: Current approaches primarily focus on final-layer features, overlooking critical multi-level cues and thus diluting essential context for accurate segmentation. 2) Layerwise Style Bias Accumulation: Although utilizing representations can partially align global features, these methods neglect domain-specific biases within intermediate layers, allowing style discrepancies to build up and reduce model robustness. To address these challenges, we propose FedBCS to bridge feature representation gaps via domain-invariant contextual prototypes alignment. Specifically, we introduce a frequency-domain adaptive style recalibration into prototype construction that not only decouples content-style representations but also learns optimal style parameters, enabling more robust domain-invariant prototypes. Furthermore, we design a context-aware dual-level prototype alignment method that extracts domain-invariant prototypes from different layers of both encoder and decoder and fuses them with contextual information for finer-grained representation alignment. Extensive experiments on two public datasets demonstrate that our method exhibits remarkable performance.
[185] YOLO Meets Mixture-of-Experts: Adaptive Expert Routing for Robust Object Detection
Ori Meiraz, Sharon Shalev, Avishai Weizman
Main category: cs.CV
TL;DR: Novel Mixture-of-Experts framework for object detection using adaptive routing among multiple YOLOv9-T experts to achieve better performance than single model.
Details
Motivation: To improve object detection performance by enabling dynamic feature specialization through multiple expert models rather than relying on a single model architecture.Method: Mixture-of-Experts framework with adaptive routing mechanism that dynamically selects among multiple YOLOv9-T expert models for specialized feature processing.
Result: Achieves higher mean Average Precision (mAP) and Average Recall (AR) compared to a single YOLOv9-T model.
Conclusion: The Mixture-of-Experts approach with adaptive routing effectively enhances object detection performance by leveraging multiple specialized experts rather than a single model.
Abstract: This paper presents a novel Mixture-of-Experts framework for object detection, incorporating adaptive routing among multiple YOLOv9-T experts to enable dynamic feature specialization and achieve higher mean Average Precision (mAP) and Average Recall (AR) compared to a single YOLOv9-T model.
[186] Find the Leak, Fix the Split: Cluster-Based Method to Prevent Leakage in Video-Derived Datasets
Noam Glazner, Noam Tsfaty, Sharon Shalev, Avishai Weizman
Main category: cs.CV
TL;DR: Proposes cluster-based frame selection to prevent information leakage in video datasets by grouping similar frames before dataset splitting.
Details
Motivation: To address information leakage in video-derived frames datasets where similar frames from the same video end up in different splits, compromising dataset reliability and evaluation validity.Method: Clusters visually similar frames before splitting into training, validation, and test sets, ensuring frames from the same cluster don’t get distributed across different splits.
Result: Produces more representative, balanced, and reliable dataset partitions by preventing information leakage through frame similarity.
Conclusion: Cluster-based frame selection is an effective strategy for creating better dataset splits in video analysis tasks by mitigating information leakage issues.
Abstract: We propose a cluster-based frame selection strategy to mitigate information leakage in video-derived frames datasets. By grouping visually similar frames before splitting into training, validation, and test sets, the method produces more representative, balanced, and reliable dataset partitions.
[187] MetaDCSeg: Robust Medical Image Segmentation via Meta Dynamic Center Weighting
Chenyu Mu, Guihai Chen, Xun Yang, Erkun Yang, Cheng Deng
Main category: cs.CV
TL;DR: MetaDCSeg is a robust medical image segmentation framework that dynamically learns pixel-wise weights to handle noisy annotations and ambiguous boundaries through a Dynamic Center Distance mechanism.
Details
Motivation: Medical image segmentation suffers from noisy annotations and ambiguous anatomical boundaries, causing training instability. Existing methods using global noise assumptions or confidence-based selection fail to adequately address performance degradation, especially in challenging boundary regions.Method: Proposes MetaDCSeg framework that dynamically learns optimal pixel-wise weights to suppress noisy ground-truth labels while preserving reliable annotations. Uses Dynamic Center Distance (DCD) mechanism to model boundary uncertainty, employing weighted feature distances for foreground, background, and boundary centers to focus on hard-to-segment pixels near ambiguous boundaries.
Result: Extensive experiments across four benchmark datasets with varying noise levels demonstrate that MetaDCSeg consistently outperforms existing state-of-the-art methods.
Conclusion: The proposed framework effectively addresses annotation noise and boundary ambiguity in medical image segmentation, enabling more precise handling of structural boundaries often overlooked by existing methods, leading to significant performance enhancement.
Abstract: Medical image segmentation is crucial for clinical applications, but it is frequently disrupted by noisy annotations and ambiguous anatomical boundaries, which lead to instability in model training. Existing methods typically rely on global noise assumptions or confidence-based sample selection, which inadequately mitigate the performance degradation caused by annotation noise, especially in challenging boundary regions. To address this issue, we propose MetaDCSeg, a robust framework that dynamically learns optimal pixel-wise weights to suppress the influence of noisy ground-truth labels while preserving reliable annotations. By explicitly modeling boundary uncertainty through a Dynamic Center Distance (DCD) mechanism, our approach utilizes weighted feature distances for foreground, background, and boundary centers, directing the model’s attention toward hard-to-segment pixels near ambiguous boundaries. This strategy enables more precise handling of structural boundaries, which are often overlooked by existing methods, and significantly enhances segmentation performance. Extensive experiments across four benchmark datasets with varying noise levels demonstrate that MetaDCSeg consistently outperforms existing state-of-the-art methods.
[188] TUN: Detecting Significant Points in Persistence Diagrams with Deep Learning
Yu Chen, Hongwei Lin
Main category: cs.CV
TL;DR: TUN is a multi-modal neural network that automatically identifies significant points in persistence diagrams using enhanced descriptors, self-attention, PointNet-style encoding, and learned fusion with per-point classification.
Details
Motivation: Persistence diagrams are powerful for understanding point cloud topology, but distinguishing genuine signals from noise remains challenging, hindering practical adoption of topological data analysis in applications requiring automated interpretation for downstream decision-making.Method: Topology Understanding Net (TUN) combines enhanced persistence diagram descriptors with self-attention mechanisms, a PointNet-style point cloud encoder, learned fusion techniques, per-point classification, stable preprocessing, and imbalance-aware training for automatic significance detection.
Result: Experiments show TUN outperforms classic methods in detecting significant points in persistence diagrams, demonstrating effectiveness in real-world applications.
Conclusion: TUN provides an automated and effective solution for identifying significant points in persistence diagrams, which is critical for downstream applications in topological data analysis.
Abstract: Persistence diagrams (PDs) provide a powerful tool for understanding the topology of the underlying shape of a point cloud. However, identifying which points in PDs encode genuine signals remains challenging. This challenge directly hinders the practical adoption of topological data analysis in many applications, where automated and reliable interpretation of persistence diagrams is essential for downstream decision-making. In this paper, we study automatic significance detection for one-dimensional persistence diagrams. Specifically, we propose Topology Understanding Net (TUN), a multi-modal network that combines enhanced PD descriptors with self-attention, a PointNet-style point cloud encoder, learned fusion, and per-point classification, alongside stable preprocessing and imbalance-aware training. It provides an automated and effective solution for identifying significant points in PDs, which are critical for downstream applications. Experiments show that TUN outperforms classic methods in detecting significant points in PDs, illustrating its effectiveness in real-world applications.
[189] Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning
Shaofeng Yin, Jiaxin Ge, Zora Zhiruo Wang, Xiuyu Li, Michael J. Black, Trevor Darrell, Angjoo Kanazawa, Haiwen Feng
Main category: cs.CV
TL;DR: VIGA is a vision-as-inverse-graphics agent that reconstructs/edits scenes through iterative execution and verification, improving one-shot baselines by 35-124% across various benchmarks.
Details
Motivation: Current vision-language models lack fine-grained spatial and physical grounding needed for one-shot vision-as-inverse-graphics tasks, requiring a more iterative approach.Method: VIGA uses a closed-loop write-run-render-compare-revise procedure with skill library (generator/verifier roles) and evolving context memory (plans, code diffs, render history).
Result: Substantial improvements over one-shot baselines: 35.32% on BlenderGym, 117.17% on SlideBench, and 124.70% on new BlenderBench benchmark.
Conclusion: VIGA demonstrates that iterative multimodal reasoning with execution-verification cycles enables task- and model-agnostic vision-as-inverse-graphics capabilities across diverse applications.
Abstract: Vision-as-inverse-graphics, the concept of reconstructing an image as an editable graphics program is a long-standing goal of computer vision. Yet even strong VLMs aren’t able to achieve this in one-shot as they lack fine-grained spatial and physical grounding capability. Our key insight is that closing this gap requires interleaved multimodal reasoning through iterative execution and verification. Stemming from this, we present VIGA (Vision-as-Inverse-Graphic Agent) that starts from an empty world and reconstructs or edits scenes through a closed-loop write-run-render-compare-revise procedure. To support long-horizon reasoning, VIGA combines (i) a skill library that alternates generator and verifier roles and (ii) an evolving context memory that contains plans, code diffs, and render history. VIGA is task-agnostic as it doesn’t require auxiliary modules, covering a wide range of tasks such as 3D reconstruction, multi-step scene editing, 4D physical interaction, and 2D document editing, etc. Empirically, we found VIGA substantially improves one-shot baselines on BlenderGym (35.32%) and SlideBench (117.17%). Moreover, VIGA is also model-agnostic as it doesn’t require finetuning, enabling a unified protocol to evaluate heterogeneous foundation VLMs. To better support this protocol, we introduce BlenderBench, a challenging benchmark that stress-tests interleaved multimodal reasoning with graphics engine, where VIGA improves by 124.70%.
[190] Heterogeneous Uncertainty-Guided Composed Image Retrieval with Fine-Grained Probabilistic Learning
Haomiao Tang, Jinpeng Wang, Minyi Zhao, Guanghao Meng, Ruisheng Luo, Long Chen, Shu-Tao Xia
Main category: cs.CV
TL;DR: HUG introduces a heterogeneous uncertainty-guided paradigm for composed image retrieval, addressing noise in CIR triplets through fine-grained probabilistic learning with Gaussian embeddings and customized uncertainty estimation for multi-modal queries.
Details
Motivation: Intrinsic noise in CIR triplets creates uncertainty that threatens model robustness. Existing probabilistic approaches fail for CIR due to instance-level holistic modeling and homogeneous treatment of queries and targets.Method: HUG uses fine-grained probabilistic learning with Gaussian embeddings for queries and targets. It customizes heterogeneous uncertainty estimation for multi-modal queries vs uni-modal targets, captures content quality and multi-modal coordination uncertainties, and implements dynamic weighting. Includes uncertainty-guided objectives with holistic and fine-grained contrasts and comprehensive negative sampling.
Result: Experiments on benchmarks show HUG outperforms state-of-the-art baselines, with analysis justifying technical contributions.
Conclusion: HUG effectively addresses CIR uncertainty through heterogeneous uncertainty-guided learning, demonstrating superior performance over existing approaches.
Abstract: Composed Image Retrieval (CIR) enables image search by combining a reference image with modification text. Intrinsic noise in CIR triplets incurs intrinsic uncertainty and threatens the model’s robustness. Probabilistic learning approaches have shown promise in addressing such issues; however, they fall short for CIR due to their instance-level holistic modeling and homogeneous treatment of queries and targets. This paper introduces a Heterogeneous Uncertainty-Guided (HUG) paradigm to overcome these limitations. HUG utilizes a fine-grained probabilistic learning framework, where queries and targets are represented by Gaussian embeddings that capture detailed concepts and uncertainties. We customize heterogeneous uncertainty estimations for multi-modal queries and uni-modal targets. Given a query, we capture uncertainties not only regarding uni-modal content quality but also multi-modal coordination, followed by a provable dynamic weighting mechanism to derive comprehensive query uncertainty. We further design uncertainty-guided objectives, including query-target holistic contrast and fine-grained contrasts with comprehensive negative sampling strategies, which effectively enhance discriminative learning. Experiments on benchmarks demonstrate HUG’s effectiveness beyond state-of-the-art baselines, with faithful analysis justifying the technical contributions.
[191] SUG-Occ: An Explicit Semantics and Uncertainty Guided Sparse Learning Framework for Real-Time 3D Occupancy Prediction
Hanlin Wu, Pengfei Lin, Ehsan Javanmardi, Naren Bao, Bo Qian, Hao Si, Manabu Tsukada
Main category: cs.CV
TL;DR: SUG-Occ: A sparse learning framework for efficient 3D semantic occupancy prediction using semantics and uncertainty guidance to reduce computation while maintaining accuracy.
Details
Motivation: 3D semantic occupancy prediction provides detailed scene understanding but suffers from prohibitive computation and memory overhead, making real-time deployment challenging. The inherent sparsity of 3D scenes presents an opportunity to reduce redundant computation while maintaining completeness.Method: 1) Uses semantic and uncertainty priors to suppress free space projections during view transformation with explicit unsigned distance encoding for geometric consistency. 2) Cascade sparse completion module with hyper cross sparse convolution and generative upsampling for coarse-to-fine reasoning. 3) Object contextual representation (OCR) based mask decoder that aggregates global semantic context from sparse features via lightweight query-context interactions instead of expensive attention operations.
Result: Extensive experiments on SemanticKITTI benchmark show the approach outperforms baselines with 7.34% improvement in accuracy and 57.8% gain in efficiency.
Conclusion: SUG-Occ successfully addresses the efficiency challenges of 3D semantic occupancy prediction by exploiting scene sparsity through semantics and uncertainty guidance, achieving both improved accuracy and significant computational efficiency gains suitable for real-time autonomous driving applications.
Abstract: As autonomous driving moves toward full scene understanding, 3D semantic occupancy prediction has emerged as a crucial perception task, offering voxel-level semantics beyond traditional detection and segmentation paradigms. However, such a refined representation for scene understanding incurs prohibitive computation and memory overhead, posing a major barrier to practical real-time deployment. To address this, we propose SUG-Occ, an explicit Semantics and Uncertainty Guided Sparse Learning Enabled 3D Occupancy Prediction Framework, which exploits the inherent sparsity of 3D scenes to reduce redundant computation while maintaining geometric and semantic completeness. Specifically, we first utilize semantic and uncertainty priors to suppress projections from free space during view transformation while employing an explicit unsigned distance encoding to enhance geometric consistency, producing a structurally consistent sparse 3D representation. Secondly, we design an cascade sparse completion module via hyper cross sparse convolution and generative upsampling to enable efficiently coarse-to-fine reasoning. Finally, we devise an object contextual representation (OCR) based mask decoder that aggregates global semantic context from sparse features and refines voxel-wise predictions via lightweight query-context interactions, avoiding expensive attention operations over volumetric features. Extensive experiments on SemanticKITTI benchmark demonstrate that the proposed approach outperforms the baselines, achieving a 7.34/% improvement in accuracy and a 57.8% gain in efficiency.
[192] GutenOCR: A Grounded Vision-Language Front-End for Documents
Hunter Heidenreich, Ben Elliott, Olivia Dinica, Yosheb Getachew
Main category: cs.CV
TL;DR: GutenOCR is a family of OCR models built by fine-tuning Qwen2.5-VL vision-language models, offering unified reading, detection, and grounding capabilities through prompt-based interface with significant performance improvements over base models.
Details
Motivation: To create a unified OCR system that combines reading, detection, and grounding capabilities in a single checkpoint, addressing limitations of existing OCR systems that typically separate these functions and lack grounding capabilities.Method: Fine-tuned Qwen2.5-VL-3B and Qwen2.5-VL-7B models on business documents, scientific articles, and synthetic grounding data. The models support full-page and localized reading with bounding boxes at line and paragraph levels, plus conditional “where is x?” queries through a prompt-based interface.
Result: GutenOCR-7B more than doubles the composite grounded OCR score of its Qwen2.5-VL-7B backbone (0.40 to 0.82) on 10.5K held-out business and scientific pages. Substantial improvements in region- and line-level OCR and text-detection recall on Fox and OmniDocBench v1.5 benchmarks, though with trade-offs in page-level linearization, color-guided OCR, and formula-heavy layouts.
Conclusion: GutenOCR successfully demonstrates that fine-tuning vision-language models can create effective unified OCR systems with strong grounding capabilities, though certain specialized OCR tasks remain challenging and reveal trade-offs in the approach.
Abstract: GutenOCR is a family of grounded OCR front-ends obtained by fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B. The resulting single-checkpoint vision-language models expose reading, detection, and grounding through a unified, prompt-based interface. Trained on business documents, scientific articles, and synthetic grounding data, the models support full-page and localized reading with line- and paragraph-level bounding boxes and conditional ``where is x?’’ queries. We introduce a grounded OCR evaluation protocol and show that GutenOCR-7B more than doubles the composite grounded OCR score of its Qwen2.5-VL-7B backbone on 10.5K held-out business and scientific pages (0.40 to 0.82). On Fox and OmniDocBench v1.5, our approach substantially improves region- and line-level OCR as well as text-detection recall, but reveals trade-offs in page-level linearization, color-guided OCR, and formula-heavy layouts.
[193] StyMam: A Mamba-Based Generator for Artistic Style Transfer
Zhou Hong, Rongsheng Hu, Yicheng Di, Xiaolong Xu, Ning Dong, Yihua Shao, Run Ling, Yun Wang, Juqin Wang, Zhanjie Zhang, Ao Ma
Main category: cs.CV
TL;DR: Proposes StyMam, a Mamba-based generator for image style transfer that addresses artifacts and disharmony in GAN-based methods while preserving content structure better than diffusion models.
Details
Motivation: Current GAN-based methods struggle with capturing both local and global dependencies, leading to artifacts and disharmonious patterns. SD-based methods reduce these issues but often fail to preserve content structures and have slow inference speeds.Method: Revisits GAN approach with a Mamba-based generator (StyMam) featuring: 1) residual dual-path strip scanning mechanism for efficient local texture feature capture, and 2) channel-reweighted spatial attention module for modeling global dependencies.
Result: Extensive experiments show the proposed method outperforms state-of-the-art algorithms in both quality (reducing artifacts and disharmony) and speed (faster inference than SD-based methods).
Conclusion: StyMam successfully addresses limitations of both GAN and SD-based approaches by combining Mamba architecture with novel mechanisms for local and global feature modeling, achieving superior style transfer performance.
Abstract: Image style transfer aims to integrate the visual patterns of a specific artistic style into a content image while preserving its content structure. Existing methods mainly rely on the generative adversarial network (GAN) or stable diffusion (SD). GAN-based approaches using CNNs or Transformers struggle to jointly capture local and global dependencies, leading to artifacts and disharmonious patterns. SD-based methods reduce such issues but often fail to preserve content structures and suffer from slow inference. To address these issues, we revisit GAN and propose a mamba-based generator, termed as StyMam, to produce high-quality stylized images without introducing artifacts and disharmonious patterns. Specifically, we introduce a mamba-based generator with a residual dual-path strip scanning mechanism and a channel-reweighted spatial attention module. The former efficiently captures local texture features, while the latter models global dependencies. Finally, extensive qualitative and quantitative experiments demonstrate that the proposed method outperforms state-of-the-art algorithms in both quality and speed.
[194] GO-MLVTON: Garment Occlusion-Aware Multi-Layer Virtual Try-On with Diffusion Models
Yang Yu, Yunze Deng, Yige Zhang, Yanjie Xiao, Youkun Ou, Wenhao Hu, Mingchao Li, Bin Feng, Wenyu Liu, Dandan Zheng, Jingdong Chen
Main category: cs.CV
TL;DR: GO-MLVTON is the first multi-layer virtual try-on method that addresses occlusion relationships between inner and outer garments using Garment Occlusion Learning and StableDiffusion-based garment deformation.
Details
Motivation: Existing VTON methods focus on single-layer or multi-garment try-on but neglect multi-layer VTON, which requires realistic deformation and layering of multiple garment layers with accurate occlusion modeling.Method: Proposes GO-MLVTON with two key modules: 1) Garment Occlusion Learning module to learn occlusion relationships between inner and outer garments, and 2) StableDiffusion-based Garment Morphing & Fitting module to deform and fit garments onto the human body.
Result: The method achieves state-of-the-art performance, produces high-quality multi-layer try-on results, and introduces the MLG dataset and LACD metric for evaluation.
Conclusion: GO-MLVTON successfully addresses the multi-layer VTON challenge by modeling garment occlusion relationships and using diffusion-based garment fitting, establishing a new benchmark for multi-layer virtual try-on.
Abstract: Existing image-based virtual try-on (VTON) methods primarily focus on single-layer or multi-garment VTON, neglecting multi-layer VTON (ML-VTON), which involves dressing multiple layers of garments onto the human body with realistic deformation and layering to generate visually plausible outcomes. The main challenge lies in accurately modeling occlusion relationships between inner and outer garments to reduce interference from redundant inner garment features. To address this, we propose GO-MLVTON, the first multi-layer VTON method, introducing the Garment Occlusion Learning module to learn occlusion relationships and the StableDiffusion-based Garment Morphing & Fitting module to deform and fit garments onto the human body, producing high-quality multi-layer try-on results. Additionally, we present the MLG dataset for this task and propose a new metric named Layered Appearance Coherence Difference (LACD) for evaluation. Extensive experiments demonstrate the state-of-the-art performance of GO-MLVTON. Project page: https://upyuyang.github.io/go-mlvton/.
[195] Scribble-Supervised Medical Image Segmentation with Dynamic Teacher Switching and Hierarchical Consistency
Thanh-Huy Nguyen, Hoang-Loc Cao, Dat T. Chung, Mai-Anh Vu, Thanh-Minh Nguyen, Minh Le, Phat K. Huynh, Ulas Bagci
Main category: cs.CV
TL;DR: SDT-Net: A dual-teacher, single-student framework for scribble-supervised medical image segmentation that uses dynamic teacher switching and multi-level supervision to overcome annotation sparsity and boundary ambiguity.
Details
Motivation: Scribble-supervised methods reduce annotation burden but suffer from sparse annotations causing ambiguity, noisy pseudo-label propagation, and poor anatomical boundary learning in medical image segmentation.Method: Proposes SDT-Net with Dynamic Teacher Switching (DTS) to adaptively select reliable teacher, Pick Reliable Pixels (PRP) for high-confidence pseudo-label refinement, and Hierarchical Consistency (HiCo) module for multi-level feature alignment between teacher and student.
Result: Achieves state-of-the-art performance on ACDC and MSCMRseg datasets, producing more accurate and anatomically plausible segmentation compared to existing methods.
Conclusion: SDT-Net effectively addresses scribble annotation limitations through adaptive teacher selection and multi-level supervision, demonstrating superior segmentation quality for medical images with minimal annotation burden.
Abstract: Scribble-supervised methods have emerged to mitigate the prohibitive annotation burden in medical image segmentation. However, the inherent sparsity of these annotations introduces significant ambiguity, which results in noisy pseudo-label propagation and hinders the learning of robust anatomical boundaries. To address this challenge, we propose SDT-Net, a novel dual-teacher, single-student framework designed to maximize supervision quality from these weak signals. Our method features a Dynamic Teacher Switching (DTS) module to adaptively select the most reliable teacher. This selected teacher then guides the student via two synergistic mechanisms: high-confidence pseudo-labels, refined by a Pick Reliable Pixels (PRP) mechanism, and multi-level feature alignment, enforced by a Hierarchical Consistency (HiCo) module. Extensive experiments on the ACDC and MSCMRseg datasets demonstrate that SDT-Net achieves state-of-the-art performance, producing more accurate and anatomically plausible segmentation.
cs.AI
[196] Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models
Alfred Shen, Aaron Shen
Main category: cs.AI
TL;DR: Gated Sparse Attention (GSA) combines sparse attention efficiency with gated attention stability, achieving 12-16x speedup at 128K context while improving perplexity from 6.03 to 5.70 and reducing attention sink phenomena.
Details
Motivation: Address computational burden of attention in long-context language models by combining strengths of two independent approaches: sparse attention (reduces complexity) and gated attention (improves training stability, mitigates attention sink phenomenon).Method: Propose Gated Sparse Attention (GSA) with three key components: 1) gated lightning indexer with sigmoid activations for bounded, interpretable selection scores, 2) adaptive sparsity controller that modulates attended tokens based on local uncertainty, and 3) dual gating at value and output stages.
Result: With 1.7B parameter models trained on 400B tokens: matches sparse-only efficiency (12-16x speedup at 128K context), achieves quality gains (perplexity improves from 6.03 to 5.70), RULER scores nearly double at 128K context, attention to first token drops from 47% to under 4%, training stability improves with 98% reduction in loss spikes.
Conclusion: GSA successfully combines complementary benefits of sparse and gated attention, providing efficient long-context modeling with improved quality, reduced attention sinks, and enhanced training stability, supported by theoretical foundations including complexity analysis and convergence guarantees.
Abstract: The computational burden of attention in long-context language models has motivated two largely independent lines of work: sparse attention mechanisms that reduce complexity by attending to selected tokens, and gated attention variants that improve training sta-bility while mitigating the attention sink phenomenon. We observe that these approaches address complementary weaknesses and propose Gated Sparse Attention (GSA), an architecture that realizes the benefits of both. GSA incorporates a gated lightning indexer with sigmoid activations that produce bounded, interpretable selection scores, an adaptive sparsity controller that modulates the number of attended tokens based on local uncertainty, and dual gating at the value and output stages. We establish theoretical foundations for the approach, including complexity analysis, expressiveness results, and convergence guarantees. In experiments with 1.7B parameter models trained on 400B tokens, GSA matches the efficiency of sparse-only baselines (12-16x speedup at 128K context) while achieving the quality gains associated with gated attention: perplexity improves from 6.03 to 5.70, RULER scores at 128K context nearly double, and attention to the first token, a proxy for attention sinks, drops from 47% to under 4%. Training stability improves markedly, with loss spikes reduced by 98%.
[197] Uncovering Latent Bias in LLM-Based Emergency Department Triage Through Proxy Variables
Ethan Zhang
Main category: cs.AI
TL;DR: LLMs in emergency triage show hidden biases through proxy variables, modifying perceived patient severity based on specific tokens regardless of framing, revealing imperfect training on noisy signals.
Details
Motivation: Despite advances in LLM integration into clinical decision-making, hidden biases against patients across racial, social, economic, and clinical backgrounds persist, requiring investigation of bias in LLM-based medical AI systems for emergency department triage.Method: Used 32 patient-level proxy variables with paired positive/negative qualifiers, evaluated on both public (MIMIC-IV-ED Demo, MIMIC-IV Demo) and restricted-access credentialed (MIMIC-IV-ED, MIMIC-IV) datasets to assess bias in LLM-based emergency triage systems.
Result: Found discriminatory behavior mediated through proxy variables in ED triage, and systematic tendency for LLMs to modify perceived patient severity when specific tokens appear in input context, regardless of positive/negative framing.
Conclusion: AI systems are imperfectly trained on noisy, non-causal signals that don’t reliably reflect true patient acuity, requiring more work to ensure safe and responsible deployment of AI in clinical settings.
Abstract: Recent advances in large language models (LLMs) have enabled their integration into clinical decision-making; however, hidden biases against patients across racial, social, economic, and clinical backgrounds persist. In this study, we investigate bias in LLM-based medical AI systems applied to emergency department (ED) triage. We employ 32 patient-level proxy variables, each represented by paired positive and negative qualifiers, and evaluate their effects using both public (MIMIC-IV-ED Demo, MIMIC-IV Demo) and restricted-access credentialed (MIMIC-IV-ED and MIMIC-IV) datasets as appropriate~\cite{mimiciv_ed_demo,mimiciv_ed,mimiciv}. Our results reveal discriminatory behavior mediated through proxy variables in ED triage scenarios, as well as a systematic tendency for LLMs to modify perceived patient severity when specific tokens appear in the input context, regardless of whether they are framed positively or negatively. These findings indicate that AI systems is still imperfectly trained on noisy, sometimes non-causal signals that do not reliably reflect true patient acuity. Consequently, more needs to be done to ensure the safe and responsible deployment of AI technologies in clinical settings.
[198] DeepSurvey-Bench: Evaluating Academic Value of Automatically Generated Scientific Survey
Guo-Biao Zhang, Ding-Yuan Liu, Da-Yi Wu, Tian Lan, Heyan Huang, Zhijing Wu, Xian-Ling Mao
Main category: cs.AI
TL;DR: DeepSurvey-Bench is a new benchmark for evaluating the academic value of AI-generated scientific surveys, addressing limitations of existing benchmarks that focus only on surface-level metrics.
Details
Motivation: Existing benchmarks for evaluating generated scientific surveys are flawed because they use unreliable ground truth datasets (selected by citation counts/structural coherence) and focus only on surface-level metrics like structural quality, failing to assess deep academic value such as research objectives and critical analysis.Method: Proposed DeepSurvey-Bench with comprehensive academic value evaluation criteria covering three dimensions: informational value, scholarly communication value, and research guidance value. Constructed a reliable dataset with academic value annotations to evaluate deep academic value of generated surveys.
Result: Extensive experiments show the benchmark is highly consistent with human performance in assessing academic value of generated surveys.
Conclusion: DeepSurvey-Bench provides a more comprehensive and reliable way to evaluate the true academic value of AI-generated scientific surveys, addressing critical gaps in existing evaluation approaches.
Abstract: The rapid development of automated scientific survey generation technology has made it increasingly important to establish a comprehensive benchmark to evaluate the quality of generated surveys.Nearly all existing evaluation benchmarks rely on flawed selection criteria such as citation counts and structural coherence to select human-written surveys as the ground truth survey datasets, and then use surface-level metrics such as structural quality and reference relevance to evaluate generated surveys.However, these benchmarks have two key issues: (1) the ground truth survey datasets are unreliable because of a lack academic dimension annotations; (2) the evaluation metrics only focus on the surface quality of the survey such as logical coherence. Both issues lead to existing benchmarks cannot assess to evaluate their deep “academic value”, such as the core research objectives and the critical analysis of different studies. To address the above problems, we propose DeepSurvey-Bench, a novel benchmark designed to comprehensively evaluate the academic value of generated surveys. Specifically, our benchmark propose a comprehensive academic value evaluation criteria covering three dimensions: informational value, scholarly communication value, and research guidance value. Based on this criteria, we construct a reliable dataset with academic value annotations, and evaluate the deep academic value of the generated surveys. Extensive experimental results demonstrate that our benchmark is highly consistent with human performance in assessing the academic value of generated surveys.
[199] MiRAGE: A Multiagent Framework for Generating Multimodal Multihop Question-Answer Dataset for RAG Evaluation
Chandan Kumar Sahu, Premith Kumar Chilukuri, Matthew Hetrich
Main category: cs.AI
TL;DR: MiRAGE is a multiagent framework that automatically generates verified, domain-specific, multimodal, multi-hop QA datasets for evaluating RAG systems, addressing the lack of specialized benchmarks for complex enterprise applications.
Details
Motivation: Existing RAG evaluation benchmarks are inadequate for multimodal, high-stakes enterprise applications because they rely on general-domain corpora or purely textual retrieval, failing to capture the complexity of specialized technical documents where information is multimodal and reasoning requires synthesizing disjoint evidence.Method: MiRAGE uses a collaborative swarm of specialized agents: 1) recursive context optimization loop to aggregate scattered evidence, 2) adversarial verifier agent to guarantee factual grounding, and 3) persona/domain recognition agent to mimic expert cognitive workflows, generating verified multimodal QA datasets.
Result: MiRAGE generates datasets with significantly higher reasoning complexity (>2.3 average hops) and factual faithfulness across four domains (regulations, finance, quantitative biology, journalism). Ablation studies show it works with LLMs if image descriptions are available, though visual grounding remains challenging.
Conclusion: MiRAGE automates creation of gold standard evaluation datasets reflecting proprietary corpora’s latent thematic structure, providing necessary infrastructure to rigorously benchmark next-generation multimodal RAG systems for enterprise applications.
Abstract: The rapid evolution of Retrieval-Augmented Generation (RAG) toward multimodal, high-stakes enterprise applications has outpaced the development of domain specific evaluation benchmarks. Existing datasets often rely on general-domain corpora or purely textual retrieval, failing to capture the complexity of specialized technical documents where information is inextricably multimodal and reasoning requires synthesizing disjoint evidence. We address this gap by introducing MiRAGE, a Multiagent framework for RAG systems Evaluation, that leverages a collaborative swarm of specialized agents to generate verified, domain-specific, multimodal, and multi-hop Question-Answer datasets. MiRAGE orchestrates a swarm of specialized agents: a recursive context optimization loop to aggregate scattered evidence, an adversarial verifier agent to guarantee factual grounding, and an agent to recognize the expert persona and the relevant domain to mimic expert cognitive workflows. Extensive empirical evaluation across four distinct domains (regulations, finance, quantitative biology, and journalism) demonstrates that MiRAGE generates datasets with significantly higher reasoning complexity (>2.3 average hops) and factual faithfulness. Our ablation studies point that MiRAGE can be powered by LLMs if textual descriptions of the images are available. Visual grounding still remains a frontier. By automating the creation of gold standard evaluation datasets that reflect the latent thematic structure of proprietary corpora, MiRAGE provides the necessary infrastructure to rigorously benchmark the next generation information retrieval systems.
[200] Aeon: High-Performance Neuro-Symbolic Memory Management for Long-Horizon LLM Agents
Mustafa Arslan
Main category: cs.AI
TL;DR: Aeon is a neuro-symbolic cognitive OS that solves LLM memory limitations with structured memory management, achieving sub-millisecond retrieval while maintaining state consistency for autonomous agents.
Details
Motivation: LLMs face quadratic computational costs in self-attention and "Lost in the Middle" degradation with long contexts. Current "Flat RAG" approaches treat memory as unstructured embeddings, causing "Vector Haze" where retrieved facts lack episodic continuity and hierarchical structure.Method: Aeon structures memory into: 1) Memory Palace (spatial index via Atlas - SIMD-accelerated Page-Clustered Vector Index combining small-world graphs with B+ Tree disk locality), 2) Trace (neuro-symbolic episodic graph), and 3) Semantic Lookaside Buffer (predictive caching exploiting conversational locality). Uses zero-copy C++/Python bridge for state consistency.
Result: Achieves <1ms retrieval latency on conversational workloads while ensuring state consistency. Effectively enables persistent, structured memory for autonomous agents by minimizing read amplification and maintaining episodic continuity.
Conclusion: Aeon redefines memory as a managed OS resource rather than static storage, solving LLM memory limitations through neuro-symbolic structuring and predictive caching, enabling efficient long-horizon interactions for autonomous systems.
Abstract: Large Language Models (LLMs) are fundamentally constrained by the quadratic computational cost of self-attention and the “Lost in the Middle” phenomenon, where reasoning capabilities degrade as context windows expand. Existing solutions, primarily “Flat RAG” architectures relying on vector databases, treat memory as an unstructured bag of embeddings. This approach fails to capture the hierarchical and temporal structure of long-horizon interactions, leading to “Vector Haze”, the retrieval of disjointed facts lacking episodic continuity. We propose Aeon, a Neuro-Symbolic Cognitive Operating System that redefines memory not as a static store, but as a managed OS resource. Aeon structures memory into a Memory Palace (a spatial index implemented via Atlas, a SIMD-accelerated Page-Clustered Vector Index that combines small-world graph navigation with B+ Tree-style disk locality to minimize read amplification) and a Trace (a neuro-symbolic episodic graph). We introduce the Semantic Lookaside Buffer (SLB), a predictive caching mechanism that exploits conversational locality to achieve sub-millisecond retrieval latencies. Benchmarks demonstrate that Aeon achieves < 1ms retrieval latency on conversational workloads while ensuring state consistency via a zero-copy C++/Python bridge, effectively enabling persistent, structured memory for autonomous agents.
[201] ALIGNAgent: Adaptive Learner Intelligence for Gap Identification and Next-step guidance
Bismack Tokoli, Luis Jaimes, Ayesha S. Dina
Main category: cs.AI
TL;DR: ALIGNAgent is a multi-agent educational framework that integrates knowledge estimation, skill-gap identification, and personalized resource recommendation into a cohesive adaptive learning cycle.
Details
Motivation: Existing personalized learning systems are fragmented, specializing in either knowledge tracing, diagnostic modeling, or resource recommendation separately, but rarely integrating these components into a unified adaptive cycle for comprehensive personalized education.Method: ALIGNAgent uses a multi-agent framework with: 1) Skill Gap Agent that processes student quiz performance, gradebook data, and preferences to generate topic-level proficiency estimates using concept-level diagnostic reasoning; 2) Recommender Agent that retrieves preference-aware learning materials aligned with diagnosed deficiencies; and 3) a continuous feedback loop where interventions occur before advancing to subsequent topics.
Result: Extensive evaluation on authentic datasets from two undergraduate computer science courses shows GPT-4o-based agents achieving precision of 0.87-0.90 and F1 scores of 0.84-0.87 in knowledge proficiency estimation, validated against actual exam performance.
Conclusion: ALIGNAgent effectively integrates multiple personalized learning components into a cohesive adaptive framework, demonstrating strong performance in knowledge proficiency estimation and providing a comprehensive solution for personalized education.
Abstract: Personalized learning systems have emerged as a promising approach to enhance student outcomes by tailoring educational content, pacing, and feedback to individual needs. However, most existing systems remain fragmented, specializing in either knowledge tracing, diagnostic modeling, or resource recommendation, but rarely integrating these components into a cohesive adaptive cycle. In this paper, we propose ALIGNAgent (Adaptive Learner Intelligence for Gap Identification and Next-step guidance), a multi-agent educational framework designed to deliver personalized learning through integrated knowledge estimation, skill-gap identification, and targeted resource recommendation.ALIGNAgent begins by processing student quiz performance, gradebook data, and learner preferences to generate topic-level proficiency estimates using a Skill Gap Agent that employs concept-level diagnostic reasoning to identify specific misconceptions and knowledge deficiencies. After identifying skill gaps, the Recommender Agent retrieves preference-aware learning materials aligned with diagnosed deficiencies, implementing a continuous feedback loop where interventions occur before advancing to subsequent topics. Extensive empirical evaluation on authentic datasets from two undergraduate computer science courses demonstrates ALIGNAgent’s effectiveness, with GPT-4o-based agents achieving precision of 0.87-0.90 and F1 scores of 0.84-0.87 in knowledge proficiency estimation validated against actual exam performance.
[202] The Paradigm Shift: A Comprehensive Survey on Large Vision Language Models for Multimodal Fake News Detection
Wei Ai, Yilong Tan, Yuntao Shou, Tao Meng, Haowen Chen, Zhixiong He, Keqin Li
Main category: cs.AI
TL;DR: This paper provides the first comprehensive survey on multimodal fake news detection using large vision-language models, tracing the paradigm shift from traditional feature engineering to unified end-to-end frameworks.
Details
Motivation: The rapid evolution of LVLMs has transformed multimodal fake news detection, but there's no systematic survey documenting this transition and consolidating recent developments. The field needs a comprehensive review to map the evolution and guide future research.Method: The survey provides a historical perspective mapping evolution from conventional pipelines to foundation model-driven paradigms, establishes a structured taxonomy covering model architectures, datasets, and performance benchmarks, and analyzes technical challenges.
Result: This is the first comprehensive survey to systematically document and analyze the transformative role of LVLMs in combating multimodal fake news, providing a structured framework for understanding the field’s evolution and current state.
Conclusion: The survey outlines future research directions to guide the next stage of this paradigm shift, addressing challenges like interpretability, temporal reasoning, and domain generalization in multimodal fake news detection using LVLMs.
Abstract: In recent years, the rapid evolution of large vision-language models (LVLMs) has driven a paradigm shift in multimodal fake news detection (MFND), transforming it from traditional feature-engineering approaches to unified, end-to-end multimodal reasoning frameworks. Early methods primarily relied on shallow fusion techniques to capture correlations between text and images, but they struggled with high-level semantic understanding and complex cross-modal interactions. The emergence of LVLMs has fundamentally changed this landscape by enabling joint modeling of vision and language with powerful representation learning, thereby enhancing the ability to detect misinformation that leverages both textual narratives and visual content. Despite these advances, the field lacks a systematic survey that traces this transition and consolidates recent developments. To address this gap, this paper provides a comprehensive review of MFND through the lens of LVLMs. We first present a historical perspective, mapping the evolution from conventional multimodal detection pipelines to foundation model-driven paradigms. Next, we establish a structured taxonomy covering model architectures, datasets, and performance benchmarks. Furthermore, we analyze the remaining technical challenges, including interpretability, temporal reasoning, and domain generalization. Finally, we outline future research directions to guide the next stage of this paradigm shift. To the best of our knowledge, this is the first comprehensive survey to systematically document and analyze the transformative role of LVLMs in combating multimodal fake news. The summary of existing methods mentioned is in our Github: \href{https://github.com/Tan-YiLong/Overview-of-Fake-News-Detection}{https://github.com/Tan-YiLong/Overview-of-Fake-News-Detection}.
[203] Replayable Financial Agents: A Determinism-Faithfulness Assurance Harness for Tool-Using LLM Agents
Raffi Khatchadourian
Main category: cs.AI
TL;DR: DFAH framework measures LLM agent determinism and faithfulness for financial audit compliance, finding positive correlation between consistency and evidence-alignment.
Details
Motivation: LLM agents in financial services fail to produce consistent results when reproducing flagged transaction decisions, creating regulatory audit compliance issues.Method: Determinism-Faithfulness Assurance Harness (DFAH) framework for measuring trajectory determinism and evidence-conditioned faithfulness across 74 configurations (12 models, 4 providers) with non-agentic baselines and agentic tool-use experiments.
Result: 7-20B parameter models achieved 100% determinism, while 120B+ models needed 3.7x larger validation samples. Positive correlation (r=0.45) between determinism and faithfulness. Tier 1 models with schema-first architectures met audit replay requirements.
Conclusion: DFAH enables reliable measurement of LLM agent determinism for financial regulatory compliance, with schema-first architectures showing best performance for audit replay requirements.
Abstract: LLM agents struggle with regulatory audit replay: when asked to reproduce a flagged transaction decision with identical inputs, most deployments fail to return consistent results. This paper introduces the Determinism-Faithfulness Assurance Harness (DFAH), a framework for measuring trajectory determinism and evidence-conditioned faithfulness in tool-using agents deployed in financial services. Across 74 configurations (12 models, 4 providers, 8-24 runs each at T=0.0) in non-agentic baseline experiments, 7-20B parameter models achieved 100% determinism, while 120B+ models required 3.7x larger validation samples to achieve equivalent statistical reliability. Agentic tool-use introduces additional variance (see Tables 4-7). Contrary to the assumed reliability-capability trade-off, a positive Pearson correlation emerged (r = 0.45, p < 0.01, n = 51 at T=0.0) between determinism and faithfulness; models producing consistent outputs also tended to be more evidence-aligned. Three financial benchmarks are provided (compliance triage, portfolio constraints, DataOps exceptions; 50 cases each) along with an open-source stress-test harness. In these benchmarks and under DFAH evaluation settings, Tier 1 models with schema-first architectures achieved determinism levels consistent with audit replay requirements.
[204] Beyond Prompting: Efficient and Robust Contextual Biasing for Speech LLMs via Logit-Space Integration (LOGIC)
Peidong Wang
Main category: cs.AI
TL;DR: LOGIC is a decoding-layer framework that efficiently integrates contextual entity knowledge into Speech LLMs without prompting limitations, achieving 9% WER reduction with minimal false alarms.
Details
Motivation: Speech LLMs struggle with recognizing new, domain-specific entities (names, playlists, jargon) due to static training. Prompting solutions have scalability issues (context window limits, latency, lost-in-middle), while GEC approaches suffer from over-correction hallucinations.Method: LOGIC (Logit-Space Integration for Contextual Biasing) operates directly in the decoding layer, decoupling context injection from input processing. This ensures constant-time complexity regardless of prompt length, unlike prompting approaches.
Result: Extensive experiments with Phi-4-MM model across 11 multilingual locales show LOGIC achieves average 9% relative reduction in Entity Word Error Rate with only 0.30% increase in False Alarm Rate.
Conclusion: LOGIC provides an efficient, robust solution for contextual biasing in Speech LLMs that overcomes scalability limitations of prompting while avoiding over-correction issues of generative error correction methods.
Abstract: The rapid emergence of new entities – driven by cultural shifts, evolving trends, and personalized user data – poses a significant challenge for existing Speech Large Language Models (Speech LLMs). While these models excel at general conversational tasks, their static training knowledge limits their ability to recognize domain-specific terms such as contact names, playlists, or technical jargon. Existing solutions primarily rely on prompting, which suffers from poor scalability: as the entity list grows, prompting encounters context window limitations, increased inference latency, and the “lost-in-the-middle” phenomenon. An alternative approach, Generative Error Correction (GEC), attempts to rewrite transcripts via post-processing but frequently suffers from “over-correction”, introducing hallucinations of entities that were never spoken. In this work, we introduce LOGIC (Logit-Space Integration for Contextual Biasing), an efficient and robust framework that operates directly in the decoding layer. Unlike prompting, LOGIC decouples context injection from input processing, ensuring constant-time complexity relative to prompt length. Extensive experiments using the Phi-4-MM model across 11 multilingual locales demonstrate that LOGIC achieves an average 9% relative reduction in Entity WER with a negligible 0.30% increase in False Alarm Rate.
[205] Prometheus Mind: Retrofitting Memory to Frozen Language Models
Mark Wind
Main category: cs.AI
TL;DR: Prometheus Mind adds memory to frozen Qwen3-4B using modular adapters (7% overhead) that are fully reversible, solving four key problems: semantic direction discovery, stage-wise training, weight-based injection, and hidden state collapse recovery.
Details
Motivation: Adding memory to pretrained language models typically requires architectural changes or weight modification, which can be intrusive and irreversible. The authors aim to develop a reversible memory system that can be added to frozen models without permanent changes.Method: Uses 11 modular adapters (530MB total) added to frozen Qwen3-4B. Solves four problems: (1) Contrastive Direction Discovery (CDD) for semantic direction extraction without labeled data, (2) Stage-wise training of each adapter on simple proxy tasks, (3) Using existing lm_head.weight rows for injection without training, (4) Training projections to recover distinction from collapsed hidden states.
Result: Achieves 94.4% retrieval accuracy on clean inputs (n=54, 95% CI: [84.9%, 98.1%]) on PrometheusExtract-132 dataset. Performance degrades to 19.4% on informal inputs with ellipsis, filler words, or implicit subjects. Primary bottleneck is relation classification (47.3% accuracy), responsible for most extraction errors.
Conclusion: Prometheus Mind demonstrates that reversible memory can be added to frozen language models with minimal overhead (7%), though performance is limited by relation classification accuracy and degrades significantly with informal language. The system successfully addresses key technical challenges in memory retrofitting.
Abstract: Adding memory to pretrained language models typically requires architectural changes or weight modification. We present Prometheus Mind, which retrofits memory to a frozen Qwen3-4B using 11 modular adapters (530MB, 7% overhead) – fully reversible by removing the adapters. Building this system required solving four problems: (1) Extraction – we develop Contrastive Direction Discovery (CDD), which finds semantic directions via minimal pairs without labeled data. (2) Training – end-to-end optimization collapses; stage-wise training of each adapter on simple proxy tasks succeeds. (3) Injection – learned encoders fail to generalize; we find that lm_head.weight rows already provide the mapping we need, requiring no training. (4) Hidden state collapse – transformers make wife'' and brother’’ 0.98+ similar; we train projections to recover distinction (0.98 $\rightarrow$ 0.09). On PrometheusExtract-132 (132 cases), the system achieves 94.4% retrieval on clean inputs (n=54, 95% CI: [84.9%, 98.1%]), degrading to 19.4% on informal inputs with ellipsis, filler words, or implicit subjects (n=36). The primary bottleneck is relation classification (47.3% accuracy), responsible for most extraction errors.
[206] Logic Programming on Knowledge Graph Networks And its Application in Medical Domain
Chuanqing Wang, Zhenmin Zhao, Shanshan Du, Chaoqun Fei, Songmao Zhang, Ruqian Lu
Main category: cs.AI
TL;DR: This paper introduces “knowledge graph network” as a systematic framework to address limitations in current knowledge graph applications, particularly in medicine and healthcare, by incorporating advanced reasoning, AI techniques, and multi-graph cooperation/competition methods.
Details
Motivation: Current knowledge graph research has advanced rapidly but lacks sufficient application of advanced techniques like logic reasoning, AI methods, probabilistic theories, and especially multi-graph cooperation/competition approaches in medical domains.Method: Develops a systematic theory of “knowledge graph network” covering definition, development, reasoning, computing, and application under various conditions (unsharp, uncertain, multi-modal, vectorized, distributed, federated), with real data examples and experiments.
Result: The paper provides comprehensive framework with examples and experimental results for knowledge graph networks applied to medical and healthcare domains under different challenging conditions.
Conclusion: Presents an innovative systematic approach to knowledge graph networks that addresses current limitations and enables more sophisticated applications in medicine and healthcare through advanced reasoning and multi-graph techniques.
Abstract: The rash development of knowledge graph research has brought big driving force to its application in many areas, including the medicine and healthcare domain. However, we have found that the application of some major information processing techniques on knowledge graph still lags behind. This defect includes the failure to make sufficient use of advanced logic reasoning, advanced artificial intelligence techniques, special-purpose programming languages, modern probabilistic and statistic theories et al. on knowledge graphs development and application. In particular, the multiple knowledge graphs cooperation and competition techniques have not got enough attention from researchers. This paper develops a systematic theory, technique and application of the concept ‘knowledge graph network’ and its application in medical and healthcare domain. Our research covers its definition, development, reasoning, computing and application under different conditions such as unsharp, uncertain, multi-modal, vectorized, distributed, federated. Almost in each case we provide (real data) examples and experiment results. Finally, a conclusion of innovation is provided.
[207] GeMM-GAN: A Multimodal Generative Model Conditioned on Histopathology Images and Clinical Descriptions for Gene Expression Profile Generation
Francesca Pia Panaccione, Carlo Sgaravatti, Pietro Pinoli
Main category: cs.AI
TL;DR: GeMM-GAN is a GAN that generates realistic gene expression profiles from histopathology images and clinical metadata, outperforming existing methods by 11% on disease prediction tasks.
Details
Motivation: Gene expression data is valuable for biomedical research but difficult to obtain due to privacy regulations and high costs, while medical images and clinical metadata are more readily available. There's a need to generate synthetic gene expression data from these accessible modalities.Method: GeMM-GAN uses a Transformer Encoder for histopathology image patches with a Cross Attention mechanism between image patches and text tokens (clinical metadata) to create a conditioning vector. This vector guides a generative model to produce biologically coherent gene expression profiles.
Result: The model outperforms standard generative models on the TCGA dataset, generating more realistic and functionally meaningful gene expression profiles. It improves disease type prediction accuracy by over 11% compared to state-of-the-art generative models.
Conclusion: GeMM-GAN successfully addresses the challenge of generating synthetic gene expression data from accessible medical images and clinical metadata, enabling broader biomedical research applications while overcoming privacy and cost barriers.
Abstract: Biomedical research increasingly relies on integrating diverse data modalities, including gene expression profiles, medical images, and clinical metadata. While medical images and clinical metadata are routinely collected in clinical practice, gene expression data presents unique challenges for widespread research use, mainly due to stringent privacy regulations and costly laboratory experiments. To address these limitations, we present GeMM-GAN, a novel Generative Adversarial Network conditioned on histopathology tissue slides and clinical metadata, designed to synthesize realistic gene expression profiles. GeMM-GAN combines a Transformer Encoder for image patches with a final Cross Attention mechanism between patches and text tokens, producing a conditioning vector to guide a generative model in generating biologically coherent gene expression profiles. We evaluate our approach on the TCGA dataset and demonstrate that our framework outperforms standard generative models and generates more realistic and functionally meaningful gene expression profiles, improving by more than 11% the accuracy on downstream disease type prediction compared to current state-of-the-art generative models. Code will be available at: https://github.com/francescapia/GeMM-GAN
[208] TruthTensor: Evaluating LLMs through Human Imitation on Prediction Market under Drift and Holistic Reasoning
Shirin Shahabi, Spencer Graham, Haruna Isah
Main category: cs.AI
TL;DR: TruthTensor is a novel evaluation framework that assesses LLMs as human-imitation systems in real-world, high-entropy environments using live prediction markets, going beyond static benchmarks to measure calibration, drift, and risk-sensitivity.
Details
Motivation: Current language model evaluation is inadequate because static benchmarks fail to capture real-world uncertainty, distribution shift, and the gap between isolated task accuracy and human-aligned decision-making in evolving conditions.Method: TruthTensor uses forward-looking, contamination-free tasks anchored to live prediction markets, combining probabilistic scoring with drift-centric diagnostics, robustness checks, and clear human vs. automated evaluation roles with statistical testing procedures.
Result: Experiments across 500+ real markets show models with similar forecast accuracy can diverge markedly in calibration, drift, and risk-sensitivity, demonstrating the need for multi-axis evaluation (accuracy, calibration, narrative stability, cost, resource efficiency).
Conclusion: TruthTensor operationalizes modern evaluation best practices with clear hypothesis framing, careful metric selection, transparent reporting, human-in-the-loop validation, and open versioned evaluation contracts to produce defensible assessments of LLMs in real-world decision contexts.
Abstract: Evaluating language models and AI agents remains fundamentally challenging because static benchmarks fail to capture real-world uncertainty, distribution shift, and the gap between isolated task accuracy and human-aligned decision-making under evolving conditions. This paper introduces TruthTensor, a novel, reproducible evaluation paradigm that measures reasoning models not only as prediction engines but as human-imitation systems operating in socially-grounded, high-entropy environments. Building on forward-looking, contamination-free tasks, our framework anchors evaluation to live prediction markets and combines probabilistic scoring to provide a holistic view of model behavior. TruthTensor complements traditional correctness metrics with drift-centric diagnostics and explicit robustness checks for reproducibility. It specify human vs. automated evaluation roles, annotation protocols, and statistical testing procedures to ensure interpretability and replicability of results. In experiments across 500+ real markets (political, economic, cultural, technological), TruthTensor demonstrates that models with similar forecast accuracy can diverge markedly in calibration, drift, and risk-sensitivity, underscoring the need to evaluate models along multiple axes (accuracy, calibration, narrative stability, cost, and resource efficiency). TruthTensor therefore operationalizes modern evaluation best practices, clear hypothesis framing, careful metric selection, transparent compute/cost reporting, human-in-the-loop validation, and open, versioned evaluation contracts, to produce defensible assessments of LLMs in real-world decision contexts. We publicly release TruthTensor at https://truthtensor.com.
[209] Not Your Typical Sycophant: The Elusive Nature of Sycophancy in Large Language Models
Shahar Ben Natan, Oren Tsur
Main category: cs.AI
TL;DR: Researchers propose a novel zero-sum game framework using LLM-as-a-judge to evaluate sycophancy in LLMs, finding that while all tested models show sycophantic tendencies, Claude and Mistral exhibit “moral remorse” when sycophancy harms others, and all models show recency bias that interacts with sycophancy.
Details
Motivation: Prior works on evaluating LLM sycophancy often contain uncontrolled bias, noise, or manipulative language in prompts. There's a need for a more direct and neutral evaluation method that avoids these issues and better captures the complex dynamics of sycophantic behavior.Method: The authors propose a novel framework treating sycophancy evaluation as a zero-sum game in a bet setting using LLM-as-a-judge. This approach makes sycophancy serve one individual while explicitly incurring cost on another. They test four leading models: Gemini 2.5 Pro, ChatGPT 4o, Mistral-Large-Instruct-2411, and Claude Sonnet 3.7.
Result: All models exhibit sycophantic tendencies in self-serving scenarios, but Claude and Mistral show “moral remorse” and over-compensate when sycophancy harms a third party. All models display recency bias toward the last-proposed answer. Crucially, sycophancy and recency bias interact to produce a “constructive interference” effect where agreement with the user is exacerbated when the user’s opinion is presented last.
Conclusion: The proposed zero-sum game framework provides a more neutral way to evaluate LLM sycophancy. The findings reveal complex behavioral patterns where sycophancy interacts with other biases like recency, and some models exhibit ethical considerations when their sycophancy harms others, suggesting varying levels of “moral” awareness across different LLMs.
Abstract: We propose a novel way to evaluate sycophancy of LLMs in a direct and neutral way, mitigating various forms of uncontrolled bias, noise, or manipulative language, deliberately injected to prompts in prior works. A key novelty in our approach is the use of LLM-as-a-judge, evaluation of sycophancy as a zero-sum game in a bet setting. Under this framework, sycophancy serves one individual (the user) while explicitly incurring cost on another. Comparing four leading models - Gemini 2.5 Pro, ChatGpt 4o, Mistral-Large-Instruct-2411, and Claude Sonnet 3.7 - we find that while all models exhibit sycophantic tendencies in the common setting, in which sycophancy is self-serving to the user and incurs no cost on others, Claude and Mistral exhibit “moral remorse” and over-compensate for their sycophancy in case it explicitly harms a third party. Additionally, we observed that all models are biased toward the answer proposed last. Crucially, we find that these two phenomena are not independent; sycophancy and recency bias interact to produce `constructive interference’ effect, where the tendency to agree with the user is exacerbated when the user’s opinion is presented last.
[210] A tensor network formalism for neuro-symbolic AI
Alex Goessmann, Janina Schütte, Maximilian Fröhlich, Martin Eigel
Main category: cs.AI
TL;DR: Tensor network formalism unifies neural and symbolic AI by representing logical formulas and probability distributions as structured tensor decompositions, enabling hybrid logical-probabilistic models.
Details
Motivation: The unification of neural and symbolic approaches to artificial intelligence remains a central open challenge that needs to be addressed.Method: Introduces tensor network formalism with basis encoding for functions, models neural decompositions as tensor decompositions, and uses tensor network contractions as fundamental inference class with contraction message passing schemes.
Result: Develops Hybrid Logic Network for hybrid logical and probabilistic models, accompanied by python library tnreason for practical implementation.
Conclusion: Tensor networks provide a unified framework for neural-symbolic AI integration, enabling efficient reasoning algorithms and hybrid model training.
Abstract: The unification of neural and symbolic approaches to artificial intelligence remains a central open challenge. In this work, we introduce a tensor network formalism, which captures sparsity principles originating in the different approaches in tensor decompositions. In particular, we describe a basis encoding scheme for functions and model neural decompositions as tensor decompositions. The proposed formalism can be applied to represent logical formulas and probability distributions as structured tensor decompositions. This unified treatment identifies tensor network contractions as a fundamental inference class and formulates efficiently scaling reasoning algorithms, originating from probability theory and propositional logic, as contraction message passing schemes. The framework enables the definition and training of hybrid logical and probabilistic models, which we call Hybrid Logic Network. The theoretical concepts are accompanied by the python library tnreason, which enables the implementation and practical use of the proposed architectures.
[211] Reliability by design: quantifying and eliminating fabrication risk in LLMs. From generative to consultative AI: a comparative analysis in the legal domain and lessons for high-stakes knowledge bases
Alex Dantart
Main category: cs.AI
TL;DR: Advanced RAG systems with end-to-end optimization can reduce legal hallucinations to negligible levels (<0.2%), making LLMs reliable for high-stakes legal work, while standalone models are unsuitable (FCR >30%).
Details
Motivation: To make large language models reliable for high-stakes legal work by addressing the critical problem of hallucinations, which is particularly dangerous in legal contexts where accuracy and verifiability are essential.Method: The authors distinguish three AI paradigms: (1) standalone generative models, (2) basic retrieval-augmented systems, and (3) advanced end-to-end optimized RAG systems. They introduce two reliability metrics (False Citation Rate and Fabricated Fact Rate) and evaluate 2,700 judicial-style answers from 12 LLMs across 75 legal tasks using expert double-blind review.
Result: Standalone models are unsuitable for professional use (FCR above 30%). Basic RAG greatly reduces errors but still has notable misgrounding. Advanced RAG with techniques like embedding fine-tuning, re-ranking, and self-correction reduces fabrication to negligible levels (below 0.2%).
Conclusion: Trustworthy legal AI requires rigor-focused, retrieval-based architectures that emphasize verification and traceability. The study provides an evaluation framework applicable to other high-risk domains beyond law.
Abstract: This paper examines how to make large language models reliable for high-stakes legal work by reducing hallucinations. It distinguishes three AI paradigms: (1) standalone generative models (“creative oracle”), (2) basic retrieval-augmented systems (“expert archivist”), and (3) an advanced, end-to-end optimized RAG system (“rigorous archivist”). The authors introduce two reliability metrics -False Citation Rate (FCR) and Fabricated Fact Rate (FFR)- and evaluate 2,700 judicial-style answers from 12 LLMs across 75 legal tasks using expert, double-blind review. Results show that standalone models are unsuitable for professional use (FCR above 30%), while basic RAG greatly reduces errors but still leaves notable misgrounding. Advanced RAG, using techniques such as embedding fine-tuning, re-ranking, and self-correction, reduces fabrication to negligible levels (below 0.2%). The study concludes that trustworthy legal AI requires rigor-focused, retrieval-based architectures emphasizing verification and traceability, and provides an evaluation framework applicable to other high-risk domains.
[212] Tracking the Limits of Knowledge Propagation: How LLMs Fail at Multi-Step Reasoning with Conflicting Knowledge
Yiyang Feng, Zeming Chen, Haotian Wu, Jiawei Zhou, Antoine Bosselut
Main category: cs.AI
TL;DR: TRACK is a new benchmark for testing how LLMs handle conflicting knowledge during multi-step reasoning, showing that providing updated facts can actually worsen performance compared to no updates.
Details
Motivation: Current benchmarks for knowledge conflicts in LLMs focus only on single knowledge updates and fact recall, without evaluating how these updates affect downstream reasoning. There's a need to study how LLMs propagate new knowledge through multi-step reasoning when it conflicts with their parametric knowledge.Method: Introduces TRACK benchmark with three reasoning-intensive scenarios (WIKI, CODE, and MATH) that introduce multiple, realistic conflicts to mirror real-world complexity. Tests how LLMs handle conflicting knowledge during multi-step reasoning.
Result: Providing updated facts to models for reasoning can worsen performance compared to providing no updated facts, and this performance degradation exacerbates as more updated facts are provided. The failure stems from both inability to faithfully integrate updated facts and flawed reasoning even when knowledge is integrated.
Conclusion: TRACK provides a rigorous new benchmark to measure and guide future progress on propagating conflicting knowledge in multi-step reasoning, addressing limitations of existing benchmarks that focus only on single updates and fact recall.
Abstract: A common solution for mitigating outdated or incorrect information in Large Language Models (LLMs) is to provide updated facts in-context or through knowledge editing. However, these methods introduce knowledge conflicts when the knowledge update fails to overwrite the model’s parametric knowledge, which propagate to faulty reasoning. Current benchmarks for this problem, however, largely focus only on single knowledge updates and fact recall without evaluating how these updates affect downstream reasoning. In this work, we introduce TRACK (Testing Reasoning Amid Conflicting Knowledge), a new benchmark for studying how LLMs propagate new knowledge through multi-step reasoning when it conflicts with the model’s initial parametric knowledge. Spanning three reasoning-intensive scenarios (WIKI, CODE, and MATH), TRACK introduces multiple, realistic conflicts to mirror real-world complexity. Our results on TRACK reveal that providing updated facts to models for reasoning can worsen performance compared to providing no updated facts to a model, and that this performance degradation exacerbates as more updated facts are provided. We show this failure stems from both inability to faithfully integrate updated facts, but also flawed reasoning even when knowledge is integrated. TRACK provides a rigorous new benchmark to measure and guide future progress on propagating conflicting knowledge in multi-step reasoning.
[213] The Dark Side of AI Transformers: Sentiment Polarization & the Loss of Business Neutrality by NLP Transformers
Prasanna Kumar
Main category: cs.AI
TL;DR: Transformer-based sentiment analysis improves accuracy but causes polarization and loss of neutrality, creating reliability issues for industrial applications.
Details
Motivation: The paper addresses a critical problem in Applied AI Analytics: while transformers improve sentiment analysis accuracy, they inadvertently cause polarization of sentiment classes and fail to maintain neutrality, which is essential for reliable industrial applications.Method: The research appears to be based on experimental observations of transformer-based sentiment analysis models, examining how accuracy improvements in one sentiment class come at the cost of polarization in other classes and loss of neutral sentiment detection.
Result: Experiments reveal that transformer-led accuracy improvements in sentiment analysis lead to polarization of sentiment classes and failure to properly identify neutral sentiments, creating reliability problems for industry applications.
Conclusion: The dark side of transformer-based sentiment analysis is the trade-off between improved accuracy and loss of neutrality/polarization, which poses significant challenges for reliable industrial NLP applications that depend on balanced sentiment outputs.
Abstract: The use of Transfer Learning & Transformers has steadily improved accuracy and has significantly contributed in solving complex computation problems. However, this transformer led accuracy improvement in Applied AI Analytics specifically in sentiment analytics comes with the dark side. It is observed during experiments that a lot of these improvements in transformer led accuracy of one class of sentiment has been at the cost of polarization of another class of sentiment and the failing of neutrality. This lack of neutrality poses an acute problem in the Applied NLP space, which relies heavily on the computational outputs of sentiment analytics for reliable industry ready tasks.
[214] TransportAgents: a multi-agents LLM framework for traffic accident severity prediction
Zhichao Yang, Jiashu He, Jinxuan Fan, Cirillo Cinzia
Main category: cs.AI
TL;DR: TransportAgents: A hybrid multi-agent LLM framework that integrates specialized agents for different traffic data categories with an MLP fusion module to improve crash severity prediction accuracy and reliability.
Details
Motivation: Single-agent LLMs struggle with heterogeneous domain-specific crash data and tend to produce biased/unstable predictions for traffic crash severity, which is critical for emergency response and public safety planning.Method: Proposes TransportAgents - a hybrid multi-agent framework with specialized LLM agents focusing on different traffic information subsets (demographics, environmental context, incident details) whose intermediate assessments are fused by a multilayer perceptron (MLP) integration module.
Result: Outperforms traditional ML and advanced LLM baselines on CPSRMS and NEISS datasets; shows strong robustness, scalability, and cross-dataset generalizability across GPT-3.5, GPT-4o, and LLaMA-3.3 backbones; produces more balanced and well-calibrated severity predictions.
Conclusion: TransportAgents offers interpretable and reliable decision support for safety-critical applications by addressing limitations of single-agent LLMs through specialized multi-agent reasoning and MLP-based fusion.
Abstract: Accurate prediction of traffic crash severity is critical for improving emergency response and public safety planning. Although recent large language models (LLMs) exhibit strong reasoning capabilities, their single-agent architectures often struggle with heterogeneous, domain-specific crash data and tend to generate biased or unstable predictions. To address these limitations, this paper proposes TransportAgents, a hybrid multi-agent framework that integrates category-specific LLM reasoning with a multilayer perceptron (MLP) integration module. Each specialized agent focuses on a particular subset of traffic information, such as demographics, environmental context, or incident details, to produce intermediate severity assessments that are subsequently fused into a unified prediction. Extensive experiments on two complementary U.S. datasets, the Consumer Product Safety Risk Management System (CPSRMS) and the National Electronic Injury Surveillance System (NEISS), demonstrate that TransportAgents consistently outperforms both traditional machine learning and advanced LLM-based baselines. Across three representative backbones, including closed-source models such as GPT-3.5 and GPT-4o, as well as open-source models such as LLaMA-3.3, the framework exhibits strong robustness, scalability, and cross-dataset generalizability. A supplementary distributional analysis further shows that TransportAgents produces more balanced and well-calibrated severity predictions than standard single-agent LLM approaches, highlighting its interpretability and reliability for safety-critical decision support applications.
[215] From Generative Engines to Actionable Simulators: The Imperative of Physical Grounding in World Models
Zhikang Chen, Tingting Zhu
Main category: cs.AI
TL;DR: Current world models focus too much on visual realism while failing at understanding physical dynamics and causal relationships. The paper argues for reframing world models as actionable simulators with causal structure, domain constraints, and long-term stability.
Details
Motivation: The motivation is to address the problem of "visual conflation" in current world models - the mistaken belief that high-fidelity video generation equals understanding of physical and causal dynamics. The authors show that while modern models excel at pixel prediction, they violate invariant constraints, fail under intervention, and break down in safety-critical decision-making.Method: The paper presents a survey and proposes a reframing of world models as actionable simulators rather than visual engines. The proposed approach emphasizes: 1) structured 4D interfaces, 2) constraint-aware dynamics, and 3) closed-loop evaluation. The authors use medical decision-making as an “epistemic stress test” to demonstrate their arguments.
Result: The paper demonstrates that visual realism is an unreliable proxy for world understanding. Effective world models must encode causal structure, respect domain-specific constraints, and remain stable over long horizons. The value of a world model is determined by its ability to support counterfactual reasoning, intervention planning, and robust long-horizon foresight, not by how realistic its rollouts appear.
Conclusion: World models should be reframed as actionable simulators rather than visual engines. The focus should shift from visual realism to causal understanding, constraint awareness, and long-term stability, especially for safety-critical applications like medical decision-making where trial-and-error is impossible and errors are irreversible.
Abstract: A world model is an AI system that simulates how an environment evolves under actions, enabling planning through imagined futures rather than reactive perception. Current world models, however, suffer from visual conflation: the mistaken assumption that high-fidelity video generation implies an understanding of physical and causal dynamics. We show that while modern models excel at predicting pixels, they frequently violate invariant constraints, fail under intervention, and break down in safety-critical decision-making. This survey argues that visual realism is an unreliable proxy for world understanding. Instead, effective world models must encode causal structure, respect domain-specific constraints, and remain stable over long horizons. We propose a reframing of world models as actionable simulators rather than visual engines, emphasizing structured 4D interfaces, constraint-aware dynamics, and closed-loop evaluation. Using medical decision-making as an epistemic stress test, where trial-and-error is impossible and errors are irreversible, we demonstrate that a world model’s value is determined not by how realistic its rollouts appear, but by its ability to support counterfactual reasoning, intervention planning, and robust long-horizon foresight.
[216] Autonomous Business System via Neuro-symbolic AI
Cecil Pang, Hiroki Sayama
Main category: cs.AI
TL;DR: AUTOBUS is a neuro-symbolic AI system that combines LLM-based agents with predicate logic programming to automate and orchestrate complex business initiatives by modeling tasks with explicit conditions and leveraging enterprise knowledge graphs.
Details
Motivation: Current enterprise systems are siloed and rigid, making it difficult to reconfigure cross-functional processes, while LLMs lack deterministic execution of complex business logic. There's a need to bridge this gap between flexible AI interpretation and verifiable business process execution.Method: AUTOBUS integrates LLM-based AI agents, predicate-logic programming, and business-semantics-centric enterprise data in a neuro-symbolic architecture. It models initiatives as task networks with explicit conditions, organizes enterprise data as a knowledge graph translated into logic facts, and uses AI agents to synthesize task-specific logic programs executed by a logic engine.
Result: The paper introduces the AUTOBUS architecture that enables deterministic, verifiable execution of complex business initiatives while maintaining human oversight for accountability and adaptability. It provides a framework for orchestrating end-to-end business processes through AI-generated logic programs.
Conclusion: AUTOBUS successfully bridges the gap between flexible AI interpretation and deterministic business execution by combining neuro-symbolic AI approaches, enabling organizations to dynamically reconfigure cross-functional processes while maintaining human oversight and enterprise semantics.
Abstract: Current business environments require organizations to continuously reconfigure cross-functional processes, yet enterprise systems are still organized around siloed departments, rigid workflows, and hard-coded automation. Meanwhile large language models (LLMs) excel at interpreting natural language and unstructured data but lack deterministic, verifiable execution of complex business logic. To address this gap, here we introduce AUTOBUS, an Autonomous Business System that integrates LLM-based AI agents, predicate-logic programming, and business-semantics-centric enterprise data into a coherent neuro-symbolic AI architecture for orchestrating end-to-end business initiatives. AUTOBUS models an initiative as a network of tasks with explicit pre/post conditions, required data, evaluation rules, and API-level actions. Enterprise data is organized as a knowledge graph whose entities, relationships, and constraints are translated into logic facts and foundational rules, providing the semantic grounding for task reasoning. Core AI agents synthesize task instructions, enterprise semantics, and available tools into task-specific logic programs, which are executed by a logic engine that enforces constraints, coordinates auxiliary tools, and orchestrate execution of actions and outcomes. Humans define and maintain the semantics, policies and task instructions, curate tools, and supervise high-impact or ambiguous decisions, ensuring accountability and adaptability. We detail the AUTOBUS architecture, the anatomy of the AI agent generated logic programs, and the role of humans and auxiliary tools in the lifecycle of a business initiative.
[217] CogToM: A Comprehensive Theory of Mind Benchmark inspired by Human Cognition for Large Language Models
Haibo Tong, Zeyang Yue, Feifei Zhao, Erliang Lin, Lu Jia, Ruolin Chen, Yinqian Sun, Qian Zhang, Yi Zeng
Main category: cs.AI
TL;DR: CogToM is a comprehensive bilingual benchmark with 8000+ instances across 46 paradigms to evaluate LLMs’ Theory of Mind capabilities, revealing performance gaps and cognitive divergences from humans.
Details
Motivation: Existing benchmarks for evaluating Theory of Mind in LLMs are too narrow, focusing mainly on false belief tasks and failing to capture the full spectrum of human cognitive mechanisms.Method: Created CogToM benchmark with over 8000 bilingual instances across 46 paradigms, validated by 49 human annotators. Systematically evaluated 22 representative models including GPT-5.1 and Qwen3-Max.
Result: Revealed significant performance heterogeneities among models and persistent bottlenecks in specific dimensions. Analysis suggests potential divergences between LLM and human cognitive structures.
Conclusion: CogToM provides a robust instrument and perspective for investigating the evolving cognitive boundaries of LLMs, offering more comprehensive evaluation of Theory of Mind capabilities.
Abstract: Whether Large Language Models (LLMs) truly possess human-like Theory of Mind (ToM) capabilities has garnered increasing attention. However, existing benchmarks remain largely restricted to narrow paradigms like false belief tasks, failing to capture the full spectrum of human cognitive mechanisms. We introduce CogToM, a comprehensive, theoretically grounded benchmark comprising over 8000 bilingual instances across 46 paradigms, validated by 49 human annotator.A systematic evaluation of 22 representative models, including frontier models like GPT-5.1 and Qwen3-Max, reveals significant performance heterogeneities and highlights persistent bottlenecks in specific dimensions. Further analysis based on human cognitive patterns suggests potential divergences between LLM and human cognitive structures. CogToM offers a robust instrument and perspective for investigating the evolving cognitive boundaries of LLMs.
[218] Agentic AI Governance and Lifecycle Management in Healthcare
Chandra Prakash, Mary Lind, Avneesh Sisodia
Main category: cs.AI
TL;DR: UALM blueprint for managing AI agent sprawl in healthcare with five control layers and maturity model for audit-ready oversight.
Details
Motivation: Healthcare faces agent sprawl from embedded AI agents across departments/vendors, causing duplication, unclear accountability, inconsistent controls, and persistent permissions beyond original use cases. Existing governance frameworks lack guidance for day-to-day agent fleet operations.Method: Propose Unified Agent Lifecycle Management (UALM) blueprint derived from rapid synthesis of governance standards, agent security literature, and healthcare compliance requirements. Maps gaps to five control-plane layers: identity/persona registry, orchestration/mediation, PHI-bounded context/memory, runtime policy enforcement with kill-switches, and lifecycle management with credential revocation/audit logging. Includes companion maturity model for staged adoption.
Result: UALM provides implementable pattern for audit-ready oversight that preserves local innovation while enabling safer scaling across clinical/administrative domains.
Conclusion: UALM offers healthcare leaders (CIOs, CISOs, clinical leaders) practical framework to manage AI agent sprawl with structured controls and staged implementation approach.
Abstract: Healthcare organizations are beginning to embed agentic AI into routine workflows, including clinical documentation support and early-warning monitoring. As these capabilities diffuse across departments and vendors, health systems face agent sprawl, causing duplicated agents, unclear accountability, inconsistent controls, and tool permissions that persist beyond the original use case. Existing AI governance frameworks emphasize lifecycle risk management but provide limited guidance for the day-to-day operations of agent fleets. We propose a Unified Agent Lifecycle Management (UALM) blueprint derived from a rapid, practice-oriented synthesis of governance standards, agent security literature, and healthcare compliance requirements. UALM maps recurring gaps onto five control-plane layers: (1) an identity and persona registry, (2) orchestration and cross-domain mediation, (3) PHI-bounded context and memory, (4) runtime policy enforcement with kill-switch triggers, and (5) lifecycle management and decommissioning linked to credential revocation and audit logging. A companion maturity model supports staged adoption. UALM offers healthcare CIOs, CISOs, and clinical leaders an implementable pattern for audit-ready oversight that preserves local innovation and enables safer scaling across clinical and administrative domains.
[219] Predictive Coding and Information Bottleneck for Hallucination Detection in Large Language Models
Manish Bhatt
Main category: cs.AI
TL;DR: A hybrid hallucination detection framework combining neuroscience-inspired signals (Predictive Coding and Information Bottleneck) with supervised ML achieves 0.8669 AUROC with 75x less data and 1000x faster inference than existing methods.
Details
Motivation: Hallucinations in LLMs remain a critical barrier to deployment. Current detection methods are computationally expensive (external retrieval loops) or require large opaque LLM judges (70B+ parameters). There's a need for efficient, interpretable detection methods.Method: Hybrid detection framework combining neuroscience-inspired signal design with supervised ML. Extracts interpretable signals from Predictive Coding (quantifying surprise against internal priors) and Information Bottleneck (measuring signal retention under perturbation). Key enhancements: Entity-Focused Uptake, Context Adherence, and Falsifiability Score.
Result: Achieves 0.8669 AUROC on HaluBench (n=200), representing 4.95% gain over baseline. Uses 75x less training data than Lynx (200 vs 15,000 samples), 1000x faster inference (5ms vs 5s), and remains fully interpretable. Negative finding: Rationalization signal fails to distinguish hallucinations.
Conclusion: Domain knowledge encoded in signal architecture provides superior data efficiency compared to scaling LLM judges. Achieves strong performance with lightweight (<1M parameter), explainable models suitable for production deployment, demonstrating that theory-guided approaches can outperform brute-force scaling.
Abstract: Hallucinations in Large Language Models (LLMs) – generations that are plausible but factually unfaithful – remain a critical barrier to high-stakes deployment. Current detection methods typically rely on computationally expensive external retrieval loops or opaque black-box LLM judges requiring 70B+ parameters. In this work, we introduce [Model Name], a hybrid detection framework that combines neuroscience-inspired signal design with supervised machine learning. We extract interpretable signals grounded in Predictive Coding (quantifying surprise against internal priors) and the Information Bottleneck (measuring signal retention under perturbation). Through systematic ablation, we demonstrate three key enhancements: Entity-Focused Uptake (concentrating on high-value tokens), Context Adherence (measuring grounding strength), and Falsifiability Score (detecting confident but contradictory claims). Evaluating on HaluBench (n=200, perfectly balanced), our theory-guided baseline achieves 0.8017 AUROC. BASE supervised models reach 0.8274 AUROC, while IMPROVED features boost performance to 0.8669 AUROC (4.95% gain), demonstrating consistent improvements across architectures. This competitive performance is achieved while using 75x less training data than Lynx (200 vs 15,000 samples), 1000x faster inference (5ms vs 5s), and remaining fully interpretable. Crucially, we report a negative result: the Rationalization signal fails to distinguish hallucinations, suggesting that LLMs generate coherent reasoning for false premises (“Sycophancy”). This work demonstrates that domain knowledge encoded in signal architecture provides superior data efficiency compared to scaling LLM judges, achieving strong performance with lightweight (less than 1M parameter), explainable models suitable for production deployment.
[220] Improving Methodologies for Agentic Evaluations Across Domains: Leakage of Sensitive Information, Fraud and Cybersecurity Threats
Ee Wei Seah, Yongsen Zheng, Naga Nikshith, Mahran Morsidi, Gabriel Waikin Loh Matienzo, Nigel Gay, Akriti Vij, Benjamin Chua, En Qi Ng, Sharmini Johnson, Vanessa Wilfred, Wan Sie Lee, Anna Davidson, Catherine Devine, Erin Zorer, Gareth Holvey, Harry Coppock, James Walpole, Jerome Wynee, Magda Dubois, Michael Schmatz, Patrick Keane, Sam Deverett, Bill Black, Bo Yan, Bushra Sabir, Frank Sun, Hao Zhang, Harriet Farlow, Helen Zhou, Lingming Dong, Qinghua Lu, Seung Jang, Sharif Abuadbba, Simon O’Callaghan, Suyu Ma, Tom Howroyd, Cyrus Fung, Fatemeh Azadi, Isar Nejadgholi, Krishnapriya Vishnubhotla, Pulei Xiong, Saeedeh Lohrasbi, Scott Buffett, Shahrear Iqbal, Sowmya Vajjala, Anna Safont-Andreu, Luca Massarelli, Oskar van der Wal, Simon Möller, Agnes Delaborde, Joris Duguépéroux, Nicolas Rolin, Romane Gallienne, Sarah Behanzin, Tom Seimandi, Akiko Murakami, Takayuki Semitsu, Teresa Tsukiji, Angela Kinuthia, Michael Michie, Stephanie Kasaon, Jean Wangari, Hankyul Baek, Jaewon Noh, Kihyuk Nam, Sang Seo, Sungpil Shin, Taewhi Lee, Yongsu Kim
Main category: cs.AI
TL;DR: International collaboration develops best practices for testing AI agents across languages and cultures, focusing on methodological issues rather than model performance.
Details
Motivation: As autonomous AI agents are deployed globally, there's a need to ensure they handle different languages and cultures accurately and securely. Current agent testing is still developing, and reduced oversight of real-world interactions introduces new risks.Method: International collaboration involving representatives from multiple countries split into two strands: (1) common risks (sensitive information leakage and fraud) led by Singapore, and (2) cybersecurity led by UK. Evaluated open and closed-weight models using public agentic benchmarks, focusing on methodological issues.
Result: This marks the third exercise building on previous collaborations, representing an important step forward in advancing the science of agentic evaluations. The focus was on understanding methodological issues in conducting tests rather than examining specific test results or model capabilities.
Conclusion: International collaboration is crucial for developing best practices in agentic AI testing as systems are deployed globally. The exercise advances the science of evaluations by addressing methodological challenges in testing across languages and cultures.
Abstract: The rapid rise of autonomous AI systems and advancements in agent capabilities are introducing new risks due to reduced oversight of real-world interactions. Yet agent testing remains nascent and is still a developing science. As AI agents begin to be deployed globally, it is important that they handle different languages and cultures accurately and securely. To address this, participants from The International Network for Advanced AI Measurement, Evaluation and Science, including representatives from Singapore, Japan, Australia, Canada, the European Commission, France, Kenya, South Korea, and the United Kingdom have come together to align approaches to agentic evaluations. This is the third exercise, building on insights from two earlier joint testing exercises conducted by the Network in November 2024 and February 2025. The objective is to further refine best practices for testing advanced AI systems. The exercise was split into two strands: (1) common risks, including leakage of sensitive information and fraud, led by Singapore AISI; and (2) cybersecurity, led by UK AISI. A mix of open and closed-weight models were evaluated against tasks from various public agentic benchmarks. Given the nascency of agentic testing, our primary focus was on understanding methodological issues in conducting such tests, rather than examining test results or model capabilities. This collaboration marks an important step forward as participants work together to advance the science of agentic evaluations.
[221] From Passive Metric to Active Signal: The Evolving Role of Uncertainty Quantification in Large Language Models
Jiaxin Zhang, Wendi Cui, Zhuohang Li, Lifu Huang, Bradley Malin, Caiming Xiong, Chien-Sheng Wu
Main category: cs.AI
TL;DR: Survey on using uncertainty as an active control signal in LLMs to improve reliability in high-stakes domains, covering reasoning, agents, and reinforcement learning applications.
Details
Motivation: LLMs show remarkable capabilities but their unreliability remains a critical barrier to deployment in high-stakes domains. There's a need to transform uncertainty from a passive diagnostic metric into an active control signal for real-time model behavior guidance.Method: The survey analyzes how uncertainty is leveraged as an active control signal across three frontiers: 1) advanced reasoning (optimizing computation and triggering self-correction), 2) autonomous agents (governing metacognitive decisions about tool use and information seeking), and 3) reinforcement learning (mitigating reward hacking and enabling self-improvement via intrinsic rewards). Grounds these advancements in theoretical frameworks like Bayesian methods and Conformal Prediction.
Result: Provides a comprehensive overview, critical analysis, and practical design patterns for using uncertainty as an active control signal. Demonstrates a unified perspective on this transformative trend in AI reliability.
Conclusion: Mastering the new trend of uncertainty as an active control signal is essential for building the next generation of scalable, reliable, and trustworthy AI systems.
Abstract: While Large Language Models (LLMs) show remarkable capabilities, their unreliability remains a critical barrier to deployment in high-stakes domains. This survey charts a functional evolution in addressing this challenge: the evolution of uncertainty from a passive diagnostic metric to an active control signal guiding real-time model behavior. We demonstrate how uncertainty is leveraged as an active control signal across three frontiers: in \textbf{advanced reasoning} to optimize computation and trigger self-correction; in \textbf{autonomous agents} to govern metacognitive decisions about tool use and information seeking; and in \textbf{reinforcement learning} to mitigate reward hacking and enable self-improvement via intrinsic rewards. By grounding these advancements in emerging theoretical frameworks like Bayesian methods and Conformal Prediction, we provide a unified perspective on this transformative trend. This survey provides a comprehensive overview, critical analysis, and practical design patterns, arguing that mastering the new trend of uncertainty is essential for building the next generation of scalable, reliable, and trustworthy AI.
[222] Agentic Uncertainty Quantification
Jiaxin Zhang, Prafulla Kumar Choubey, Kung-Hsiang Huang, Caiming Xiong, Chien-Sheng Wu
Main category: cs.AI
TL;DR: Proposes Dual-Process Agentic UQ (AUQ) framework that transforms verbalized uncertainty into active control signals to prevent error propagation in AI agents, balancing efficient execution with targeted deliberation.
Details
Motivation: AI agents suffer from "Spiral of Hallucination" where early epistemic errors propagate irreversibly. Existing methods are inadequate: uncertainty quantification only diagnoses risks without addressing them, while self-reflection suffers from continuous/aimless corrections.Method: Dual-Process Agentic UQ framework with two mechanisms: System 1 (Uncertainty-Aware Memory) implicitly propagates verbalized confidence and semantic explanations to prevent blind decisions; System 2 (Uncertainty-Aware Reflection) uses explanations as rational cues to trigger targeted inference-time resolution only when necessary.
Result: Extensive experiments on closed-loop benchmarks and open-ended deep research tasks show superior performance and trajectory-level calibration with this training-free approach.
Conclusion: The principled AUQ framework represents a significant step towards reliable agents by dynamically balancing efficient execution and deep deliberation through active uncertainty management.
Abstract: Although AI agents have demonstrated impressive capabilities in long-horizon reasoning, their reliability is severely hampered by the ``Spiral of Hallucination,’’ where early epistemic errors propagate irreversibly. Existing methods face a dilemma: uncertainty quantification (UQ) methods typically act as passive sensors, only diagnosing risks without addressing them, while self-reflection mechanisms suffer from continuous or aimless corrections. To bridge this gap, we propose a unified Dual-Process Agentic UQ (AUQ) framework that transforms verbalized uncertainty into active, bi-directional control signals. Our architecture comprises two complementary mechanisms: System 1 (Uncertainty-Aware Memory, UAM), which implicitly propagates verbalized confidence and semantic explanations to prevent blind decision-making; and System 2 (Uncertainty-Aware Reflection, UAR), which utilizes these explanations as rational cues to trigger targeted inference-time resolution only when necessary. This enables the agent to balance efficient execution and deep deliberation dynamically. Extensive experiments on closed-loop benchmarks and open-ended deep research tasks demonstrate that our training-free approach achieves superior performance and trajectory-level calibration. We believe this principled framework AUQ represents a significant step towards reliable agents.
[223] Improving Methodologies for LLM Evaluations Across Global Languages
Akriti Vij, Benjamin Chua, Darshini Ramiah, En Qi Ng, Mahran Morsidi, Naga Nikshith Gangarapu, Sharmini Johnson, Vanessa Wilfred, Vikneswaran Kumaran, Wan Sie Lee, Wenzhuo Yang, Yongsen Zheng, Bill Black, Boming Xia, Frank Sun, Hao Zhang, Qinghua Lu, Suyu Ma, Yue Liu, Chi-kiu Lo, Fatemeh Azadi, Isar Nejadgholi, Sowmya Vajjala, Agnes Delaborde, Nicolas Rolin, Tom Seimandi, Akiko Murakami, Haruto Ishi, Satoshi Sekine, Takayuki Semitsu, Tasuku Sasaki, Angela Kinuthia, Jean Wangari, Michael Michie, Stephanie Kasaon, Hankyul Baek, Jaewon Noh, Kihyuk Nam, Sang Seo, Sungpil Shin, Taewhi Lee, Yongsu Kim, Daisy Newbold-Harrop, Jessica Wang, Mahmoud Ghanem, Vy Hong
Main category: cs.AI
TL;DR: Multilingual safety evaluation of AI models across 10 languages reveals varying safety behaviors, differences in safeguard robustness, and methodological insights for improving cross-cultural AI safety testing.
Details
Motivation: As AI models are deployed globally, it's essential to ensure their safety and reliability across diverse linguistic and cultural contexts, requiring examination of how current model safeguards perform in multilingual settings.Method: International collaboration tested two open-weight models across 10 languages (Cantonese, English, Farsi, French, Japanese, Korean, Kiswahili, Malay, Mandarin Chinese, Telugu) using over 6,000 translated prompts across five harm categories. Evaluation used both LLM-as-a-judge and human annotation methods.
Result: Safety behaviors vary significantly across languages, including differences in safeguard robustness across languages and harm types, and variation in evaluator reliability between LLM-as-judge and human review.
Conclusion: The work provides methodological insights for improving multilingual safety evaluations (culturally contextualized translations, stress-tested evaluator prompts, clearer guidelines) and represents an initial step toward a shared framework for multilingual safety testing, calling for continued collaboration.
Abstract: As frontier AI models are deployed globally, it is essential that their behaviour remains safe and reliable across diverse linguistic and cultural contexts. To examine how current model safeguards hold up in such settings, participants from the International Network for Advanced AI Measurement, Evaluation and Science, including representatives from Singapore, Japan, Australia, Canada, the EU, France, Kenya, South Korea and the UK conducted a joint multilingual evaluation exercise. Led by Singapore AISI, two open-weight models were tested across ten languages spanning high and low resourced groups: Cantonese English, Farsi, French, Japanese, Korean, Kiswahili, Malay, Mandarin Chinese and Telugu. Over 6,000 newly translated prompts were evaluated across five harm categories (privacy, non-violent crime, violent crime, intellectual property and jailbreak robustness), using both LLM-as-a-judge and human annotation. The exercise shows how safety behaviours can vary across languages. These include differences in safeguard robustness across languages and harm types and variation in evaluator reliability (LLM-as-judge vs. human review). Further, it also generated methodological insights for improving multilingual safety evaluations, such as the need for culturally contextualised translations, stress-tested evaluator prompts and clearer human annotation guidelines. This work represents an initial step toward a shared framework for multilingual safety testing of advanced AI systems and calls for continued collaboration with the wider research community and industry.
[224] AgentSM: Semantic Memory for Agentic Text-to-SQL
Asim Biswal, Chuan Lei, Xiao Qin, Aodong Li, Balakrishnan Narayanaswamy, Tim Kraska
Main category: cs.AI
TL;DR: AgentSM is an agentic framework for Text-to-SQL that uses interpretable semantic memory from execution traces to improve efficiency and accuracy on complex enterprise schemas.
Details
Motivation: Current LLM-based Text-to-SQL systems struggle with large enterprise schemas, diverse SQL dialects, and expensive multi-step reasoning. Agentic approaches show promise but suffer from inefficiency, instability, and inconsistent outputs.Method: AgentSM builds interpretable semantic memory by capturing prior execution traces (or synthesizing curated ones) as structured programs that directly guide future reasoning, enabling systematic reuse of reasoning paths.
Result: AgentSM reduces average token usage by 25% and trajectory length by 35% on Spider 2.0 benchmark, and achieves state-of-the-art accuracy of 44.8% on Spider 2.0 Lite benchmark.
Conclusion: AgentSM’s semantic memory approach enables more efficient and reliable scaling to complex enterprise Text-to-SQL tasks by systematically reusing reasoning paths from execution traces.
Abstract: Recent advances in LLM-based Text-to-SQL have achieved remarkable gains on public benchmarks such as BIRD and Spider. Yet, these systems struggle to scale in realistic enterprise settings with large, complex schemas, diverse SQL dialects, and expensive multi-step reasoning. Emerging agentic approaches show potential for adaptive reasoning but often suffer from inefficiency and instability-repeating interactions with databases, producing inconsistent outputs, and occasionally failing to generate valid answers. To address these challenges, we introduce Agent Semantic Memory (AgentSM), an agentic framework for Text-to-SQL that builds and leverages interpretable semantic memory. Instead of relying on raw scratchpads or vector retrieval, AgentSM captures prior execution traces-or synthesizes curated ones-as structured programs that directly guide future reasoning. This design enables systematic reuse of reasoning paths, which allows agents to scale to larger schemas, more complex questions, and longer trajectories efficiently and reliably. Compared to state-of-the-art systems, AgentSM achieves higher efficiency by reducing average token usage and trajectory length by 25% and 35%, respectively, on the Spider 2.0 benchmark. It also improves execution accuracy, reaching a state-of-the-art accuracy of 44.8% on the Spider 2.0 Lite benchmark.
[225] Investigation of the Generalisation Ability of Genetic Programming-evolved Scheduling Rules in Dynamic Flexible Job Shop Scheduling
Luyao Zhu, Fangfang Zhang, Yi Mei, Mengjie Zhang
Main category: cs.AI
TL;DR: GP-evolved scheduling rules for DFJSS show good cross-type generalization when training instances have more jobs than test instances with fixed machines, or when scales/parameters are similar. Decision point distribution is key to generalization performance.
Details
Motivation: Existing GP studies for DFJSS typically train and test on instances that differ only by random seeds, leaving cross-type generalization ability unexplored. This gap needs addressing to understand how GP-evolved rules perform on structurally different DFJSS instances.Method: Systematic investigation of GP-evolved scheduling rules’ generalization across multiple dimensions: problem scale (machines/jobs), key job shop parameters (utilization level), and data distributions. Analysis of how these factors influence performance on unseen instance types.
Result: Good generalization occurs when training instances have more jobs than test instances with fixed machines, and when training/test instances have similar scales or parameters. Decision point number and distribution crucially explain performance differences - similar distributions lead to better generalization.
Conclusion: The study provides new insights into GP generalization in DFJSS and highlights the need for evolving more generalizable rules capable of handling heterogeneous DFJSS instances effectively. Decision point distribution is a key factor in generalization performance.
Abstract: Dynamic Flexible Job Shop Scheduling (DFJSS) is a complex combinatorial optimisation problem that requires simultaneous machine assignment and operation sequencing decisions in dynamic production environments. Genetic Programming (GP) has been widely applied to automatically evolve scheduling rules for DFJSS. However, existing studies typically train and test GP-evolved rules on DFJSS instances of the same type, which differ only by random seeds rather than by structural characteristics, leaving their cross-type generalisation ability largely unexplored. To address this gap, this paper systematically investigates the generalisation ability of GP-evolved scheduling rules under diverse DFJSS conditions. A series of experiments are conducted across multiple dimensions, including problem scale (i.e., the number of machines and jobs), key job shop parameters (e.g., utilisation level), and data distributions, to analyse how these factors influence GP performance on unseen instance types. The results show that good generalisation occurs when the training instances contain more jobs than the test instances while keeping the number of machines fixed, and when both training and test instances have similar scales or job shop parameters. Further analysis reveals that the number and distribution of decision points in DFJSS instances play a crucial role in explaining these performance differences. Similar decision point distributions lead to better generalisation, whereas significant discrepancies result in a marked degradation of performance. Overall, this study provides new insights into the generalisation ability of GP in DFJSS and highlights the necessity of evolving more generalisable GP rules capable of handling heterogeneous DFJSS instances effectively.
[226] Benchmarking Text-to-Python against Text-to-SQL: The Impact of Explicit Logic and Ambiguity
Hangle Hu, Chenyu Hou, Bin Cao, Ruizhe Li
Main category: cs.AI
TL;DR: BIRD-Python benchmark addresses Text-to-Python reliability gap vs SQL, showing Python can match SQL performance when domain knowledge gaps are filled via Logic Completion Framework.
Details
Motivation: Real-world analytics increasingly need Python/Pandas for file-based data and complex workflows, but Text-to-Python reliability remains underexplored compared to mature SQL ecosystem.Method: Created BIRD-Python benchmark by refining original dataset to reduce noise and align execution semantics. Proposed Logic Completion Framework (LCF) to resolve ambiguity by incorporating latent domain knowledge into generation process.
Result: Performance differences stem from missing domain context, not inherent code generation limitations. When domain gaps are addressed, Text-to-Python achieves performance parity with Text-to-SQL.
Conclusion: Python is viable for analytical agents if systems effectively ground ambiguous natural language inputs in executable logical specifications.
Abstract: While Text-to-SQL remains the dominant approach for database interaction, real-world analytics increasingly require the flexibility of general-purpose programming languages such as Python or Pandas to manage file-based data and complex analytical workflows. Despite this growing need, the reliability of Text-to-Python in core data retrieval remains underexplored relative to the mature SQL ecosystem. To address this gap, we introduce BIRD-Python, a benchmark designed for cross-paradigm evaluation. We systematically refined the original dataset to reduce annotation noise and align execution semantics, thereby establishing a consistent and standardized baseline for comparison. Our analysis reveals a fundamental paradigmatic divergence: whereas SQL leverages implicit DBMS behaviors through its declarative structure, Python requires explicit procedural logic, making it highly sensitive to underspecified user intent. To mitigate this challenge, we propose the Logic Completion Framework (LCF), which resolves ambiguity by incorporating latent domain knowledge into the generation process. Experimental results show that (1) performance differences primarily stem from missing domain context rather than inherent limitations in code generation, and (2) when these gaps are addressed, Text-to-Python achieves performance parity with Text-to-SQL. These findings establish Python as a viable foundation for analytical agents-provided that systems effectively ground ambiguous natural language inputs in executable logical specifications. Resources are available at https://anonymous.4open.science/r/Bird-Python-43B7/.
[227] PhysProver: Advancing Automatic Theorem Proving for Physics
Hanning Zhang, Ruida Wang, Rui Pan, Wenyuan Wang, Bingxu Meng, Tong Zhang
Main category: cs.AI
TL;DR: First approach to enhance formal theorem proving in physics domain using RLVR training on PhysLeanData, achieving 2.4% improvement in physics domains and 1.3% gains on math benchmark.
Details
Motivation: While verifiable languages and LLMs have advanced mathematical theorem proving, formal physics reasoning has been neglected despite relying on similar problem-solving frameworks. There's a need to extend formal theorem proving beyond mathematics into physics domains.Method: Created PhysLeanData dataset from PhysLean theorems and conjecture-based formal data generation. Used DeepSeek-Prover-V2-7B as base model and applied Reinforcement Learning with Verifiable Rewards (RLVR) to train PhysProver model.
Result: With only ~5K training samples, PhysProver achieved 2.4% overall improvement across multiple physics sub-domains. Also showed 1.3% gains on MiniF2F-Test benchmark, demonstrating generalization beyond physics to formal math capabilities.
Conclusion: The approach effectively extends formal theorem proving to physics domains, showing both domain-specific improvements and cross-domain generalization. Provides a paradigm for expanding formal provers beyond mathematics, with dataset and model to be released.
Abstract: The combination of verifiable languages and LLMs has significantly influenced both the mathematical and computer science communities because it provides a rigorous foundation for theorem proving. Recent advancements in the field provide foundation models and sophisticated agentic systems pushing the boundaries of formal mathematical reasoning to approach the natural language capability of LLMs. However, little attention has been given to the formal physics reasoning, which also heavily relies on similar problem-solving and theorem-proving frameworks. To solve this problem, this paper presents, to the best of our knowledge, the first approach to enhance formal theorem proving in the physics domain. We compose a dedicated dataset PhysLeanData for the task. It is composed of theorems sampled from PhysLean and data generated by a conjecture-based formal data generation pipeline. In the training pipeline, we leverage DeepSeek-Prover-V2-7B, a strong open-source mathematical theorem prover, and apply Reinforcement Learning with Verifiable Rewards (RLVR) to train our model PhysProver. Comprehensive experiments demonstrate that, using only $\sim$5K training samples, PhysProver achieves an overall 2.4% improvement in multiple sub-domains. Furthermore, after formal physics training, we observe 1.3% gains on the MiniF2F-Test benchmark, which indicates non-trivial generalization beyond physics domains and enhancement for formal math capability as well. The results highlight the effectiveness and efficiency of our approach, which provides a paradigm for extending formal provers outside mathematical domains. To foster further research, we will release both our dataset and model to the community.
[228] Tabular Incremental Inference
Xinda Chen, Xing Zhen, Hanyu Zhang, Weimin Tan, Bo Yan
Main category: cs.AI
TL;DR: Tabular Incremental Inference (TabII) enables AI models to handle dynamically changing table columns during inference, addressing limitations of fixed-column training approaches.
Details
Motivation: Traditional AI models trained on tables with fixed columns cannot handle dynamically changing tables due to technological advancements, changing needs, and data integration. A new approach is needed for unsupervised handling of such dynamic tables.Method: Frames TabII as an optimization problem based on information bottleneck theory. Uses Large Language Model placeholders and Pretrained TabAdapter for external knowledge, with Incremental Sample Condensation blocks to condense task-relevant information from incremental column attributes.
Result: Experimental results across eight public datasets show TabII effectively utilizes incremental attributes and achieves state-of-the-art performance.
Conclusion: TabII successfully addresses the challenge of dynamically changing tables by enabling models to incorporate new columns during inference, enhancing AI model practicality in real-world scenarios with evolving data structures.
Abstract: Tabular data is a fundamental form of data structure. The evolution of table analysis tools reflects humanity’s continuous progress in data acquisition, management, and processing. The dynamic changes in table columns arise from technological advancements, changing needs, data integration, etc. However, the standard process of training AI models on tables with fixed columns and then performing inference is not suitable for handling dynamically changed tables. Therefore, new methods are needed for efficiently handling such tables in an unsupervised manner. In this paper, we introduce a new task, Tabular Incremental Inference (TabII), which aims to enable trained models to incorporate new columns during the inference stage, enhancing the practicality of AI models in scenarios where tables are dynamically changed. Furthermore, we demonstrate that this new task can be framed as an optimization problem based on the information bottleneck theory, which emphasizes that the key to an ideal tabular incremental inference approach lies in minimizing mutual information between tabular data and representation while maximizing between representation and task labels. Under this guidance, we design a TabII method with Large Language Model placeholders and Pretrained TabAdapter to provide external knowledge and Incremental Sample Condensation blocks to condense the task-relevant information given by incremental column attributes. Experimental results across eight public datasets show that TabII effectively utilizes incremental attributes, achieving state-of-the-art performance.
[229] Off-Policy Actor-Critic with Sigmoid-Bounded Entropy for Real-World Robot Learning
Xiefeng Wu, Mingyu Hu, Shu Zhang
Main category: cs.AI
TL;DR: SigEnt-SAC: An off-policy actor-critic RL method that learns from scratch using just one expert trajectory, featuring a sigmoid-bounded entropy term to prevent negative entropy issues and reduce Q-function oscillations.
Details
Motivation: Real-world RL deployment faces challenges: sample inefficiency, sparse rewards, and noisy visual observations. Existing methods require large datasets (offline-to-online) or extensive pretraining (VLA-assisted RL), lacking a low-cost solution with minimal data requirements.Method: SigEnt-SAC introduces a sigmoid-bounded entropy term in an off-policy actor-critic framework. This prevents optimization toward out-of-distribution actions driven by negative entropy and reduces Q-function oscillations. The method learns from scratch using only a single expert trajectory.
Result: On D4RL benchmarks, SigEnt-SAC substantially alleviates Q-function oscillations and reaches 100% success rate faster than prior methods. In real-world robotic tasks across multiple embodiments with raw images and sparse rewards, it learns successful policies with minimal real-world interactions.
Conclusion: SigEnt-SAC provides a low-cost, practical pathway for real-world RL deployment by enabling learning from minimal data (single expert trajectory) while maintaining stability and efficiency through sigmoid-bounded entropy regularization.
Abstract: Deploying reinforcement learning in the real world remains challenging due to sample inefficiency, sparse rewards, and noisy visual observations. Prior work leverages demonstrations and human feedback to improve learning efficiency and robustness. However, offline-to-online methods need large datasets and can be unstable, while VLA-assisted RL relies on large-scale pretraining and fine-tuning. As a result, a low-cost real-world RL method with minimal data requirements has yet to emerge. We introduce \textbf{SigEnt-SAC}, an off-policy actor-critic method that learns from scratch using a single expert trajectory. Our key design is a sigmoid-bounded entropy term that prevents negative-entropy-driven optimization toward out-of-distribution actions and reduces Q-function oscillations. We benchmark SigEnt-SAC on D4RL tasks against representative baselines. Experiments show that SigEnt-SAC substantially alleviates Q-function oscillations and reaches a 100% success rate faster than prior methods. Finally, we validate SigEnt-SAC on four real-world robotic tasks across multiple embodiments, where agents learn from raw images and sparse rewards; results demonstrate that SigEnt-SAC can learn successful policies with only a small number of real-world interactions, suggesting a low-cost and practical pathway for real-world RL deployment.
[230] Agentic Confidence Calibration
Jiaxin Zhang, Caiming Xiong, Chien-Sheng Wu
Main category: cs.AI
TL;DR: HTC is a novel framework for calibrating AI agent confidence by analyzing entire task trajectories, outperforming existing methods and providing interpretability, transferability, and generalization.
Details
Motivation: AI agents are becoming autonomous but suffer from overconfidence in failures, especially in high-stakes settings. Existing calibration methods designed for static single-turn outputs cannot handle agent-specific challenges like compounding errors, tool uncertainty, and opaque failure modes in multi-step trajectories.Method: Proposes Holistic Trajectory Calibration (HTC), a diagnostic framework that extracts rich process-level features from an agent’s entire trajectory, ranging from macro dynamics to micro stability. Uses a simple, interpretable model to analyze these features for confidence calibration.
Result: HTC consistently outperforms strong baselines in both calibration and discrimination across eight benchmarks, multiple LLMs, and diverse agent frameworks. Achieves best calibration (lowest ECE) on out-of-domain GAIA benchmark via General Agent Calibrator (GAC).
Conclusion: HTC establishes a new process-centric paradigm for confidence calibration, providing a framework for diagnosing and enhancing AI agent reliability with interpretability, transferability, and generalization capabilities.
Abstract: AI agents are rapidly advancing from passive language models to autonomous systems executing complex, multi-step tasks. Yet their overconfidence in failure remains a fundamental barrier to deployment in high-stakes settings. Existing calibration methods, built for static single-turn outputs, cannot address the unique challenges of agentic systems, such as compounding errors along trajectories, uncertainty from external tools, and opaque failure modes. To address these challenges, we introduce, for the first time, the problem of Agentic Confidence Calibration and propose Holistic Trajectory Calibration (HTC), a novel diagnostic framework that extracts rich process-level features ranging from macro dynamics to micro stability across an agent’s entire trajectory. Powered by a simple, interpretable model, HTC consistently surpasses strong baselines in both calibration and discrimination, across eight benchmarks, multiple LLMs, and diverse agent frameworks. Beyond performance, HTC delivers three essential advances: it provides interpretability by revealing the signals behind failure, enables transferability by applying across domains without retraining, and achieves generalization through a General Agent Calibrator (GAC) that achieves the best calibration (lowest ECE) on the out-of-domain GAIA benchmark. Together, these contributions establish a new process-centric paradigm for confidence calibration, providing a framework for diagnosing and enhancing the reliability of AI agents.
[231] Creativity in the Age of AI: Rethinking the Role of Intentional Agency
James S. Pearson, Matthew J. Dennis, Marc Cheong
Main category: cs.AI
TL;DR: The paper argues against requiring intentional agency for creativity, showing how generative AI challenges this traditional view and proposing a consistency requirement instead.
Details
Motivation: The traditional Intentional Agency Condition (IAC) for creativity is increasingly problematic due to advances in generative AI, which can produce creative outputs without intentional agency, creating tension between theory and practice.Method: 1) Corpus analysis showing increased ascriptions of creativity to AI despite lack of intentional agency; 2) Conceptual engineering approach to analyze the social function of creativity concepts; 3) Proposing a consistency requirement as alternative to IAC.
Result: The IAC should be rejected as a general condition for creativity because: 1) Linguistic evidence shows people increasingly attribute creativity to AI; 2) IAC no longer serves its core social function and instead biases assessments of AI outputs.
Conclusion: Creativity should be defined by reliable generation of novel and valuable products (consistency requirement), not intentional agency, though IAC may still be relevant in specific local domains where intentionality matters.
Abstract: Many theorists of creativity maintain that intentional agency is a necessary condition of creativity. We argue that this requirement, which we call the Intentional Agency Condition (IAC), should be rejected as a general condition of creativity, while retaining its relevance in specific contexts. We show that recent advances in generative AI have rendered the IAC increasingly problematic, both descriptively and functionally. We offer two reasons for abandoning it at the general level. First, we present corpus evidence indicating that authors and journalists are increasingly comfortable ascribing creativity to generative AI, despite its lack of intentional agency. This development places pressure on the linguistic intuitions that have traditionally been taken to support the IAC. Second, drawing on the method of conceptual engineering, we argue that the IAC no longer fulfils its core social function. Rather than facilitating the identification and encouragement of reliable sources of novel and valuable products, it now feeds into biases that distort our assessments of AI-generated outputs. We therefore propose replacing the IAC with a consistency requirement, according to which creativity tracks the reliable generation of novel and valuable products. Nonetheless, we explain why the IAC should be retained in specific local domains.
[232] VitalDiagnosis: AI-Driven Ecosystem for 24/7 Vital Monitoring and Chronic Disease Management
Zhikai Xue, Tianqianjin Lin, Pengwei Yan, Ruichun Wang, Yuxin Liu, Zhuoren Jiang, Xiaozhong Liu
Main category: cs.AI
TL;DR: VitalDiagnosis is an LLM-driven ecosystem that transforms chronic disease management from passive monitoring to proactive, interactive engagement by integrating wearable device data with LLM reasoning capabilities.
Details
Motivation: Chronic diseases are the leading cause of death worldwide, exacerbated by strained medical resources and aging populations. Patients struggle with interpreting early deterioration signs and maintaining care plan adherence, creating a need for more proactive management solutions.Method: The system integrates continuous data from wearable devices with LLM reasoning capabilities. It analyzes health triggers through context-aware inquiries, produces provisional insights within a collaborative patient-clinician workflow, and offers personalized guidance.
Result: The approach enables proactive and cooperative care by addressing both acute health anomalies and routine adherence issues, potentially enhancing patient self-management while reducing avoidable clinical workload.
Conclusion: VitalDiagnosis represents a paradigm shift in chronic disease management toward proactive, interactive engagement, leveraging LLM capabilities to create a more effective patient-clinician collaborative workflow.
Abstract: Chronic diseases have become the leading cause of death worldwide, a challenge intensified by strained medical resources and an aging population. Individually, patients often struggle to interpret early signs of deterioration or maintain adherence to care plans. In this paper, we introduce VitalDiagnosis, an LLM-driven ecosystem designed to shift chronic disease management from passive monitoring to proactive, interactive engagement. By integrating continuous data from wearable devices with the reasoning capabilities of LLMs, the system addresses both acute health anomalies and routine adherence. It analyzes triggers through context-aware inquiries, produces provisional insights within a collaborative patient-clinician workflow, and offers personalized guidance. This approach aims to promote a more proactive and cooperative care paradigm, with the potential to enhance patient self-management and reduce avoidable clinical workload.
[233] Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification
Yuxuan Wan, Tianqing Fang, Zaitang Li, Yintong Huo, Wenxuan Wang, Haitao Mi, Dong Yu, Michael R. Lyu
Main category: cs.AI
TL;DR: DeepVerifier: A rubrics-based verification system that enables Deep Research Agents to self-evolve at inference time by evaluating their own outputs against a failure taxonomy, achieving 8-11% accuracy gains without additional training.
Details
Motivation: Most existing Deep Research Agent (DRA) work focuses on post-training policy enhancement, but there's a need for alternative approaches that enable agents to self-improve during inference without additional training.Method: Proposes DeepVerifier, a rubrics-based outcome reward verifier that uses an automatically constructed DRA Failure Taxonomy (5 major categories, 13 sub-categories) to evaluate agent outputs. The system enables inference-time scaling where agents self-improve by verifying their own answers and using the feedback for iterative refinement.
Result: DeepVerifier outperforms vanilla agent-as-judge and LLM judge baselines by 12%-48% in meta-evaluation F1 score. Achieves 8%-11% accuracy gains on challenging subsets of GAIA and XBench-DeepResearch when using capable closed-source LLMs. Also releases DeepVerifier-4K dataset with 4,646 high-quality agent steps for open-source development.
Conclusion: The rubrics-based verification approach enables practical self-evolution of Deep Research Agents at test time without additional training, offering a promising alternative to post-training methods while supporting both closed-source and open-source advancement through released datasets.
Abstract: Recent advances in Deep Research Agents (DRAs) are transforming automated knowledge discovery and problem-solving. While the majority of existing efforts focus on enhancing policy capabilities via post-training, we propose an alternative paradigm: self-evolving the agent’s ability by iteratively verifying the policy model’s outputs, guided by meticulously crafted rubrics. This approach gives rise to the inference-time scaling of verification, wherein an agent self-improves by evaluating its generated answers to produce iterative feedback and refinements. We derive the rubrics based on an automatically constructed DRA Failure Taxonomy, which systematically classifies agent failures into five major categories and thirteen sub-categories. We present DeepVerifier, a rubrics-based outcome reward verifier that leverages the asymmetry of verification and outperforms vanilla agent-as-judge and LLM judge baselines by 12%-48% in meta-evaluation F1 score. To enable practical self-evolution, DeepVerifier integrates as a plug-and-play module during test-time inference. The verifier produces detailed rubric-based feedback, which is fed back to the agent for iterative bootstrapping, refining responses without additional training. This test-time scaling delivers 8%-11% accuracy gains on challenging subsets of GAIA and XBench-DeepResearch when powered by capable closed-source LLMs. Finally, to support open-source advancement, we release DeepVerifier-4K, a curated supervised fine-tuning dataset of 4,646 high-quality agent steps focused on DRA verification. These examples emphasize reflection and self-critique, enabling open models to develop robust verification capabilities.
[234] ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models
Shir Ashury-Tahan, Yifan Mai, Elron Bandel, Michal Shmueli-Scheuer, Leshem Choshen
Main category: cs.AI
TL;DR: ErrorMap is a method to identify why LLMs fail, not just when, by extracting failure signatures and creating ErrorAtlas taxonomy to reveal recurring error patterns across models and datasets.
Details
Motivation: Current LLM benchmarks only show when models fail but not why, making them incomplete for guiding model improvement. Wrong answers could stem from formatting issues, calculation errors, or dataset noise rather than weak reasoning.Method: ErrorMap extracts a model’s unique “failure signature” by analyzing errors across datasets. It works on any model or dataset with the same logic, and was applied to 35 datasets and 83 models to generate ErrorAtlas, a taxonomy of model errors.
Result: Created ErrorAtlas taxonomy revealing recurring failure patterns, highlighting underexplored error types like omissions of required details and question misinterpretation. The approach enables advanced evaluation that exposes hidden weaknesses.
Conclusion: ErrorMap and ErrorAtlas shift focus from where models succeed to why they fail, offering deeper evaluation that can be applied globally across models and tasks, providing richer insights into model behavior and limitations.
Abstract: Large Language Models (LLM) benchmarks tell us when models fail, but not why they fail. A wrong answer on a reasoning dataset may stem from formatting issues, calculation errors, or dataset noise rather than weak reasoning. Without disentangling such causes, benchmarks remain incomplete and cannot reliably guide model improvement. We introduce ErrorMap, the first method to chart the sources of LLM failure. It extracts a model’s unique “failure signature”, clarifies what benchmarks measure, and broadens error identification to reduce blind spots. This helps developers debug models, aligns benchmark goals with outcomes, and supports informed model selection. ErrorMap works on any model or dataset with the same logic. Applying our method to 35 datasets and 83 models we generate ErrorAtlas, a taxonomy of model errors, revealing recurring failure patterns. ErrorAtlas highlights error types that are currently underexplored in LLM research, such as omissions of required details in the output and question misinterpretation. By shifting focus from where models succeed to why they fail, ErrorMap and ErrorAtlas enable advanced evaluation - one that exposes hidden weaknesses and directs progress. Unlike success, typically measured by task-level metrics, our approach introduces a deeper evaluation layer that can be applied globally across models and tasks, offering richer insights into model behavior and limitations. We make the taxonomy and code publicly available with plans to periodically update ErrorAtlas as new benchmarks and models emerge.
[235] EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic Experience
Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, Jinrui Ding, Xiandi Ma, Yuchen Xie, Peng Pei, Xunliang Cai, Xipeng Qiu
Main category: cs.AI
TL;DR: EvoCUA introduces an evolutionary learning framework for computer-use agents that cycles between data generation and policy optimization, achieving state-of-the-art performance on OSWorld benchmark.
Details
Motivation: Current computer-use agents are limited by static data scaling and passive imitation learning, which fails to capture the complex causal dynamics of long-horizon computer tasks.Method: EvoCUA integrates data generation and policy optimization in an evolutionary cycle: 1) verifiable synthesis engine for diverse task generation with executable validators, 2) scalable infrastructure for thousands of asynchronous sandbox rollouts, 3) iterative evolving learning strategy that identifies capability boundaries and transforms failures into supervision through error analysis.
Result: Achieves 56.7% success rate on OSWorld benchmark, establishing new open-source SOTA, outperforming previous best open-source model OpenCUA-72B (45.0%) and closed-weights model UI-TARS-2 (53.1%). Shows consistent gains across foundation models of varying scales.
Conclusion: The evolutionary paradigm driven by learning from experience provides a robust and scalable path for advancing native agent capabilities, demonstrating generalizability across different model scales.
Abstract: The development of native computer-use agents (CUA) represents a significant leap in multimodal AI. However, their potential is currently bottlenecked by the constraints of static data scaling. Existing paradigms relying primarily on passive imitation of static datasets struggle to capture the intricate causal dynamics inherent in long-horizon computer tasks. In this work, we introduce EvoCUA, a native computer use agentic model. Unlike static imitation, EvoCUA integrates data generation and policy optimization into a self-sustaining evolutionary cycle. To mitigate data scarcity, we develop a verifiable synthesis engine that autonomously generates diverse tasks coupled with executable validators. To enable large-scale experience acquisition, we design a scalable infrastructure orchestrating tens of thousands of asynchronous sandbox rollouts. Building on these massive trajectories, we propose an iterative evolving learning strategy to efficiently internalize this experience. This mechanism dynamically regulates policy updates by identifying capability boundaries – reinforcing successful routines while transforming failure trajectories into rich supervision through error analysis and self-correction. Empirical evaluations on the OSWorld benchmark demonstrate that EvoCUA achieves a success rate of 56.7%, establishing a new open-source state-of-the-art. Notably, EvoCUA significantly outperforms the previous best open-source model, OpenCUA-72B (45.0%), and surpasses leading closed-weights models such as UI-TARS-2 (53.1%). Crucially, our results underscore the generalizability of this approach: the evolving paradigm driven by learning from experience yields consistent performance gains across foundation models of varying scales, establishing a robust and scalable path for advancing native agent capabilities.
[236] ICON: Invariant Counterfactual Optimization with Neuro-Symbolic Priors for Text-Based Person Search
Xiangyu Wang, Zhixin Lv, Yongjiao Sun, Anrui Han, Ye Yuan, Hangxu Ji
Main category: cs.AI
TL;DR: ICON is a framework for Text-Based Person Search that uses causal and topological priors to achieve geometric invariance and environmental independence, shifting from statistical co-occurrence learning to causal invariance.
Details
Motivation: Current TBPS models using pre-training fail in complex open-world scenarios due to "Passive Observation" causing spurious correlations and spatial semantic misalignment, lacking robustness against distribution shifts.Method: Four components: 1) Rule-Guided Spatial Intervention to penalize bounding box noise sensitivity, 2) Counterfactual Context Disentanglement via background transplantation, 3) Saliency-Driven Semantic Regularization with adaptive masking, 4) Neuro-Symbolic Topological Alignment for feature matching consistency.
Result: ICON maintains leading performance on standard benchmarks and exhibits exceptional robustness against occlusion, background interference, and localization noise.
Conclusion: ICON advances TBPS by shifting from fitting statistical co-occurrences to learning causal invariance, effectively addressing fundamental defects in current paradigms.
Abstract: Text-Based Person Search (TBPS) holds unique value in real-world surveillance bridging visual perception and language understanding, yet current paradigms utilizing pre-training models often fail to transfer effectively to complex open-world scenarios. The reliance on “Passive Observation” leads to multifaceted spurious correlations and spatial semantic misalignment, causing a lack of robustness against distribution shifts. To fundamentally resolve these defects, this paper proposes ICON (Invariant Counterfactual Optimization with Neuro-symbolic priors), a framework integrating causal and topological priors. First, we introduce Rule-Guided Spatial Intervention to strictly penalize sensitivity to bounding box noise, forcibly severing location shortcuts to achieve geometric invariance. Second, Counterfactual Context Disentanglement is implemented via semantic-driven background transplantation, compelling the model to ignore background interference for environmental independence. Then, we employ Saliency-Driven Semantic Regularization with adaptive masking to resolve local saliency bias and guarantee holistic completeness. Finally, Neuro-Symbolic Topological Alignment utilizes neuro-symbolic priors to constrain feature matching, ensuring activated regions are topologically consistent with human structural logic. Experimental results demonstrate that ICON not only maintains leading performance on standard benchmarks but also exhibits exceptional robustness against occlusion, background interference, and localization noise. This approach effectively advances the field by shifting from fitting statistical co-occurrences to learning causal invariance.
[237] Natural Language-Driven Global Mapping of Martian Landforms
Yiran Wang, Shuoyuan Wang, Zhaoran Wei, Jiannan Zhao, Zhonghua Yao, Zejian Xie, Songxin Zhang, Jun Huang, Bingyi Jing, Hongxin Wei
Main category: cs.AI
TL;DR: MarScope is a planetary-scale vision-language framework that enables natural language-driven, label-free mapping of Martian landforms by aligning planetary images and text in a shared semantic space.
Details
Motivation: There's a mismatch between how planetary surfaces are analyzed (using high-level semantic concepts in natural language) and how orbital image archives are organized (at the pixel level), which limits scalable, open-ended exploration of planetary surfaces.Method: MarScope aligns planetary images and text in a shared semantic space, trained on over 200,000 curated image-text pairs, enabling natural language-driven, label-free mapping of Martian landforms.
Result: The framework enables arbitrary user queries across the entire planet in 5 seconds with F1 scores up to 0.978, transforming global geomorphic mapping by replacing pre-defined classifications with flexible semantic retrieval.
Conclusion: MarScope establishes a new paradigm where natural language serves as a direct interface for scientific discovery over massive geospatial datasets, extending beyond morphological classification to facilitate process-oriented analysis and similarity-based geomorphological mapping at a planetary scale.
Abstract: Planetary surfaces are typically analyzed using high-level semantic concepts in natural language, yet vast orbital image archives remain organized at the pixel level. This mismatch limits scalable, open-ended exploration of planetary surfaces. Here we present MarScope, a planetary-scale vision-language framework enabling natural language-driven, label-free mapping of Martian landforms. MarScope aligns planetary images and text in a shared semantic space, trained on over 200,000 curated image-text pairs. This framework transforms global geomorphic mapping on Mars by replacing pre-defined classifications with flexible semantic retrieval, enabling arbitrary user queries across the entire planet in 5 seconds with F1 scores up to 0.978. Applications further show that it extends beyond morphological classification to facilitate process-oriented analysis and similarity-based geomorphological mapping at a planetary scale. MarScope establishes a new paradigm where natural language serves as a direct interface for scientific discovery over massive geospatial datasets.
[238] Decoupling Return-to-Go for Efficient Decision Transformer
Yongyi Wang, Hanyu Liu, Lingfeng Li, Bozhou Chen, Ang Li, Qirui Zheng, Xionghui Yang, Wenxin Li
Main category: cs.AI
TL;DR: DDT simplifies Decision Transformer by removing redundant RTG sequence input, using only latest RTG for action guidance, improving performance and efficiency.
Details
Motivation: The authors identify a critical redundancy in Decision Transformer's design where feeding the entire sequence of Return-to-Go (RTG) values is theoretically unnecessary, as only the most recent RTG affects action prediction. This redundancy can impair DT's performance.Method: Propose Decoupled DT (DDT) which simplifies the architecture by processing only observation and action sequences through the Transformer, while using only the latest RTG to guide action prediction, eliminating the redundant RTG sequence input.
Result: DDT significantly outperforms original DT and establishes competitive performance against state-of-the-art DT variants across multiple offline RL tasks, while also reducing computational cost.
Conclusion: The redundancy in DT’s RTG sequence input impairs performance, and DDT’s streamlined approach of using only the latest RTG improves both performance and efficiency, demonstrating the importance of architectural simplicity in sequence modeling for offline RL.
Abstract: The Decision Transformer (DT) has established a powerful sequence modeling approach to offline reinforcement learning. It conditions its action predictions on Return-to-Go (RTG), using it both to distinguish trajectory quality during training and to guide action generation at inference. In this work, we identify a critical redundancy in this design: feeding the entire sequence of RTGs into the Transformer is theoretically unnecessary, as only the most recent RTG affects action prediction. We show that this redundancy can impair DT’s performance through experiments. To resolve this, we propose the Decoupled DT (DDT). DDT simplifies the architecture by processing only observation and action sequences through the Transformer, using the latest RTG to guide the action prediction. This streamlined approach not only improves performance but also reduces computational cost. Our experiments show that DDT significantly outperforms DT and establishes competitive performance against state-of-the-art DT variants across multiple offline RL tasks.
[239] Deja Vu in Plots: Leveraging Cross-Session Evidence with Retrieval-Augmented LLMs for Live Streaming Risk Assessment
Yiran Qiao, Xiang Ao, Jing Chen, Yang Liu, Qiwei Zhong, Qing He
Main category: cs.AI
TL;DR: CS-VAR is a novel framework for live streaming risk detection that combines a lightweight domain-specific model with LLM-guided training using cross-session behavioral evidence, enabling efficient real-time detection of recurring malicious patterns across streams.
Details
Motivation: Live streaming platforms face complex risks like scams and coordinated malicious behaviors that accumulate gradually and recur across seemingly unrelated streams, making detection challenging with traditional methods.Method: CS-VAR uses a two-model approach: a lightweight domain-specific model performs fast session-level risk inference, while an LLM reasons over retrieved cross-session behavioral evidence during training and transfers its local-to-global insights to the small model.
Result: Extensive offline experiments on large-scale industrial datasets combined with online validation demonstrate state-of-the-art performance, and CS-VAR provides interpretable, localized signals for effective real-world moderation.
Conclusion: CS-VAR enables efficient real-time detection of recurring malicious patterns across live streams while maintaining interpretability, making it practical for large-scale deployment on streaming platforms.
Abstract: The rise of live streaming has transformed online interaction, enabling massive real-time engagement but also exposing platforms to complex risks such as scams and coordinated malicious behaviors. Detecting these risks is challenging because harmful actions often accumulate gradually and recur across seemingly unrelated streams. To address this, we propose CS-VAR (Cross-Session Evidence-Aware Retrieval-Augmented Detector) for live streaming risk assessment. In CS-VAR, a lightweight, domain-specific model performs fast session-level risk inference, guided during training by a Large Language Model (LLM) that reasons over retrieved cross-session behavioral evidence and transfers its local-to-global insights to the small model. This design enables the small model to recognize recurring patterns across streams, perform structured risk assessment, and maintain efficiency for real-time deployment. Extensive offline experiments on large-scale industrial datasets, combined with online validation, demonstrate the state-of-the-art performance of CS-VAR. Furthermore, CS-VAR provides interpretable, localized signals that effectively empower real-world moderation for live streaming.
[240] Grounding Large Language Models in Reaction Knowledge Graphs for Synthesis Retrieval
Olga Bunkova, Lorenzo Di Fruscia, Sophia Rupprecht, Artur M. Schweidtmann, Marcel J. T. Reinders, Jana M. Weber
Main category: cs.AI
TL;DR: LLMs for chemical synthesis planning often hallucinate outdated suggestions. This paper proposes using Text2Cypher generation to query reaction knowledge graphs, with one-shot prompting using aligned exemplars performing best for retrieval accuracy.
Details
Motivation: Standard LLM prompting for chemical synthesis planning often produces hallucinated or outdated reaction suggestions, highlighting the need for more reliable, knowledge-graph-grounded approaches to improve accuracy and relevance.Method: Cast reaction path retrieval as Text2Cypher (natural language to graph query) generation problem. Compare zero-shot prompting to one-shot variants with static, random, and embedding-based exemplar selection. Implement checklist-driven validator/corrector loop for query validation and correction.
Result: One-shot prompting with aligned exemplars consistently performs best for query validity and retrieval accuracy. Checklist-style self-correction mainly improves executability in zero-shot settings but offers limited additional gains when good exemplars are already present.
Conclusion: Text2Cypher generation with aligned exemplar selection provides effective LLM-grounded synthesis planning. The framework offers reproducible evaluation setup for further KG-grounded LLM research in chemistry, with code publicly available.
Abstract: Large Language Models (LLMs) can aid synthesis planning in chemistry, but standard prompting methods often yield hallucinated or outdated suggestions. We study LLM interactions with a reaction knowledge graph by casting reaction path retrieval as a Text2Cypher (natural language to graph query) generation problem, and define single- and multi-step retrieval tasks. We compare zero-shot prompting to one-shot variants using static, random, and embedding-based exemplar selection, and assess a checklist-driven validator/corrector loop. To evaluate our framework, we consider query validity and retrieval accuracy. We find that one-shot prompting with aligned exemplars consistently performs best. Our checklist-style self-correction loop mainly improves executability in zero-shot settings and offers limited additional retrieval gains once a good exemplar is present. We provide a reproducible Text2Cypher evaluation setup to facilitate further work on KG-grounded LLMs for synthesis planning. Code is available at https://github.com/Intelligent-molecular-systems/KG-LLM-Synthesis-Retrieval.
[241] AgriPINN: A Process-Informed Neural Network for Interpretable and Scalable Crop Biomass Prediction Under Water Stress
Yue Shi, Liangxiu Han, Xin Zhang, Tam Sobeih, Thomas Gaiser, Nguyen Huu Thuy, Dominik Behrend, Amit Kumar Srivastava, Krishnagopal Halder, Frank Ewert
Main category: cs.AI
TL;DR: AgriPINN integrates biophysical crop-growth differential equations into deep learning to predict above-ground biomass under water stress, outperforming both data-driven and process-based models.
Details
Motivation: Current approaches have limitations: data-driven models lack interpretability and degrade under distribution shift, while process-based models require extensive calibration and are difficult to deploy at scale. There's a need for models that combine scalability with physiological consistency for crop biomass prediction under water stress.Method: AgriPINN is a process-informed neural network that integrates a biophysical crop-growth differential equation as a differentiable constraint within a deep learning backbone. It recovers latent physiological variables (LAI, PAR, RUE, water-stress factors) without direct supervision. The model is pretrained on 60 years of historical data across 397 German regions and fine-tuned on three years of field experiments under controlled water treatments.
Result: AgriPINN consistently outperforms state-of-the-art deep learning baselines (ConvLSTM-ViT, SLTF, CNN-Transformer) and the process-based LINTUL5 model, achieving RMSE reductions up to 43% while maintaining computational efficiency.
Conclusion: AgriPINN provides a robust and interpretable framework for spatio-temporal AGB prediction by combining deep learning scalability with biophysical rigor, offering practical value for irrigation planning, yield forecasting, and climate adaptation.
Abstract: Accurate prediction of crop above-ground biomass (AGB) under water stress is critical for monitoring crop productivity, guiding irrigation, and supporting climate-resilient agriculture. Data-driven models scale well but often lack interpretability and degrade under distribution shift, whereas process-based crop models (e.g. DSSAT, APSIM, LINTUL5) require extensive calibration and are difficult to deploy over large spatial domains. To address these limitations, we propose AgriPINN, a process-informed neural network that integrates a biophysical crop-growth differential equation as a differentiable constraint within a deep learning backbone. This design encourages physiologically consistent biomass dynamics under water-stress conditions while preserving model scalability for spatially distributed AGB prediction. AgriPINN recovers latent physiological variables, including leaf area index (LAI), absorbed photosynthetically active radiation (PAR), radiation use efficiency (RUE), and water-stress factors, without requiring direct supervision. We pretrain AgriPINN on 60 years of historical data across 397 regions in Germany and fine-tune it on three years of field experiments under controlled water treatments. Results show that AgriPINN consistently outperforms state-of-the-art deep-learning baselines (ConvLSTM-ViT, SLTF, CNN-Transformer) and the process-based LINTUL5 model in terms of accuracy (RMSE reductions up to $43%$) and computational efficiency. By combining the scalability of deep learning with the biophysical rigor of process-based modeling, AgriPINN provides a robust and interpretable framework for spatio-temporal AGB prediction, offering practical value for planning of irrigation infrastructure, yield forecasting, and climate-adaptation planning.
[242] Designing faster mixed integer linear programming algorithm via learning the optimal path
Ruizhi Liu, Liming Xu, Xulin Huang, Jingyan Sui, Shizhe Ding, Boyang Xia, Chungong Yu, Dongbo Bu
Main category: cs.AI
TL;DR: DeepBound is a deep learning-based node selection algorithm for Mixed-Integer Linear Programming that learns to prioritize nodes containing optimal solutions, improving solving efficiency over traditional heuristic methods.
Details
Motivation: Traditional MILP solving relies on hand-crafted heuristic strategies for node selection in branch-and-bound algorithms, which suffer from unstable and unpredictable performance across different problem instances. There's a need for automated, data-driven approaches to learn optimal node selection strategies.Method: DeepBound uses a multi-level feature fusion network to capture node representations in branch-and-bound trees. It employs a pairwise training paradigm to address node imbalance issues, enhancing the model’s ability to discriminate between nodes and prioritize those containing optimal solutions.
Result: Extensive experiments on three NP-hard MILP benchmarks show DeepBound achieves superior solving efficiency over conventional heuristic rules and existing learning-based approaches, obtaining optimal solutions with significantly reduced computation time. It also demonstrates strong generalization capability on large, complex instances.
Conclusion: DeepBound can automatically discover more flexible and robust feature selection strategies, potentially improving and replacing human-designed heuristic rules for MILP solving. The learned features analysis reveals the method’s ability to develop effective node prioritization strategies from data.
Abstract: Designing faster algorithms for solving Mixed-Integer Linear Programming (MILP) problems is highly desired across numerous practical domains, as a vast array of complex real-world challenges can be effectively modeled as MILP formulations. Solving these problems typically employs the branch-and-bound algorithm, the core of which can be conceived as searching for a path of nodes (or sub-problems) that contains the optimal solution to the original MILP problem. Traditional approaches to finding this path rely heavily on hand-crafted, intuition-based heuristic strategies, which often suffer from unstable and unpredictable performance across different MILP problem instances. To address this limitation, we introduce DeepBound, a deep learning-based node selection algorithm that automates the learning of such human intuition from data. The core of DeepBound lies in learning to prioritize nodes containing the optimal solution, thereby improving solving efficiency. DeepBound introduces a multi-level feature fusion network to capture the node representations. To tackle the inherent node imbalance in branch-and-bound trees, DeepBound employs a pairwise training paradigm that enhances the model’s ability to discriminate between nodes. Extensive experiments on three NP-hard MILP benchmarks demonstrate that DeepBound achieves superior solving efficiency over conventional heuristic rules and existing learning-based approaches, obtaining optimal feasible solutions with significantly reduced computation time. Moreover, DeepBound demonstrates strong generalization capability on large and complex instances. The analysis of its learned features reveals that the method can automatically discover more flexible and robust feature selection, which may effectively improve and potentially replace human-designed heuristic rules.
[243] Controlling Long-Horizon Behavior in Language Model Agents with Explicit State Dynamics
Sukesh Subaharan
Main category: cs.AI
TL;DR: LLM agents lack temporal coherence in extended interactions. This paper introduces an external affective subsystem with Valence-Arousal-Dominance dynamics to create more coherent agent behavior over time.
Details
Motivation: LLM agents show abrupt shifts in tone and persona during extended interactions due to lack of explicit temporal structure. Prior work focuses on turn-local sentiment or static emotion classification, leaving affective dynamics in long-horizon behavior underexplored.Method: Introduce agent-level affective subsystem with continuous Valence-Arousal-Dominance (VAD) state external to LLM. Use first- and second-order update rules with exponential smoothing or momentum-based dynamics. Affective signals extracted via fixed memoryless estimator and injected back into generation without modifying model parameters.
Result: Stateless agents fail to show coherent trajectories or recovery. State persistence enables delayed responses and reliable recovery. Second-order dynamics introduce affective inertia and hysteresis that increase with momentum, revealing trade-off between stability and responsiveness.
Conclusion: Imposing dynamical structure on external affective state can induce temporal coherence and controlled recovery in multi-turn dialogue, with second-order dynamics offering interesting trade-offs between stability and responsiveness.
Abstract: Large language model (LLM) agents often exhibit abrupt shifts in tone and persona during extended interaction, reflecting the absence of explicit temporal structure governing agent-level state. While prior work emphasizes turn-local sentiment or static emotion classification, the role of explicit affective dynamics in shaping long-horizon agent behavior remains underexplored. This work investigates whether imposing dynamical structure on an external affective state can induce temporal coherence and controlled recovery in multi-turn dialogue. We introduce an agent-level affective subsystem that maintains a continuous Valence-Arousal-Dominance (VAD) state external to the language model and governed by first- and second-order update rules. Instantaneous affective signals are extracted using a fixed, memoryless estimator and integrated over time via exponential smoothing or momentum-based dynamics. The resulting affective state is injected back into generation without modifying model parameters. Using a fixed 25-turn dialogue protocol, we compare stateless, first-order, and second-order affective dynamics. Stateless agents fail to exhibit coherent trajectories or recovery, while state persistence enables delayed responses and reliable recovery. Second-order dynamics introduce affective inertia and hysteresis that increase with momentum, revealing a trade-off between stability and responsiveness.
[244] Knowledge Graphs are Implicit Reward Models: Path-Derived Signals Enable Compositional Reasoning
Yuval Kansal, Niraj K. Jha
Main category: cs.AI
TL;DR: Bottom-up learning paradigm using knowledge graphs as reward models enables LLMs to perform compositional multi-hop reasoning in specialized domains like medicine, outperforming larger frontier models.
Details
Motivation: LLMs excel in structured reasoning (math/programming) but struggle with compositional multi-hop reasoning in specialized scientific fields. Need to ground models in domain facts and enable them to compose these facts for complex tasks.Method: Post-training pipeline combining supervised fine-tuning and reinforcement learning. Knowledge graphs act as implicit reward models, with novel reward signals derived from knowledge graph paths. Models are trained on short-hop reasoning paths (1-3 hops) and evaluated on complex multi-hop queries (4-5 hops).
Result: 14B model significantly outperforms much larger models and frontier systems (GPT-5.2, Gemini 3 Pro) on difficult reasoning tasks. Path-derived rewards act as a “compositional bridge” enabling zero-shot generalization. Approach shows robustness to adversarial perturbations in option-shuffling stress tests.
Conclusion: Grounding reasoning process in structured knowledge through knowledge graph-based rewards provides scalable and efficient path toward intelligent reasoning, enabling models to compose intermediate axioms rather than just optimizing final answers.
Abstract: Large language models have achieved near-expert performance in structured reasoning domains like mathematics and programming, yet their ability to perform compositional multi-hop reasoning in specialized scientific fields remains limited. We propose a bottom-up learning paradigm in which models are grounded in axiomatic domain facts and compose them to solve complex, unseen tasks. To this end, we present a post-training pipeline, based on a combination of supervised fine-tuning and reinforcement learning (RL), in which knowledge graphs act as implicit reward models. By deriving novel reward signals from knowledge graph paths, we provide verifiable, scalable, and grounded supervision that encourages models to compose intermediate axioms rather than optimize only final answers during RL. We validate this approach in the medical domain, training a 14B model on short-hop reasoning paths (1-3 hops) and evaluating its zero-shot generalization to complex multi-hop queries (4-5 hops). Our experiments show that path-derived rewards act as a “compositional bridge”, enabling our model to significantly outperform much larger models and frontier systems like GPT-5.2 and Gemini 3 Pro, on the most difficult reasoning tasks. Furthermore, we demonstrate the robustness of our approach to adversarial perturbations against option-shuffling stress tests. This work suggests that grounding the reasoning process in structured knowledge is a scalable and efficient path toward intelligent reasoning.
[245] Multimodal Climate Disinformation Detection: Integrating Vision-Language Models with External Knowledge Sources
Marzieh Adeli Shamsabad, Hamed Ghodrati
Main category: cs.AI
TL;DR: This paper addresses climate disinformation detection by enhancing vision-language models with external knowledge retrieval to overcome training data limitations and improve accuracy in assessing image-based claims.
Details
Motivation: Climate disinformation spread through misleading images and videos on social media poses a significant challenge, as current vision-language models are limited by their training data and cannot reason about recent events or updates, potentially delaying climate action.Method: The paper proposes combining vision-language models with external knowledge retrieval, including reverse image search results, online fact-checks, and trusted expert content, to assess the accuracy of images and their associated claims.
Result: The approach improves the system’s ability to handle real-world climate disinformation by better classifying images and claims as accurate, misleading, false, or unverifiable through access to up-to-date information.
Conclusion: Integrating external knowledge with vision-language models enhances climate disinformation detection capabilities, supporting efforts to protect public understanding of science in a rapidly evolving information landscape.
Abstract: Climate disinformation has become a major challenge in today digital world, especially with the rise of misleading images and videos shared widely on social media. These false claims are often convincing and difficult to detect, which can delay actions on climate change. While vision-language models (VLMs) have been used to identify visual disinformation, they rely only on the knowledge available at the time of training. This limits their ability to reason about recent events or updates. The main goal of this paper is to overcome that limitation by combining VLMs with external knowledge. By retrieving up-to-date information such as reverse image results, online fact-checks, and trusted expert content, the system can better assess whether an image and its claim are accurate, misleading, false, or unverifiable. This approach improves the model ability to handle real-world climate disinformation and supports efforts to protect public understanding of science in a rapidly changing information landscape.
[246] LLM Prompt Evaluation for Educational Applications
Langdon Holmes, Adam Coscia, Scott Crossley, Joon Suh Choi, Wesley Morris
Main category: cs.AI
TL;DR: Researchers developed a systematic tournament-style method to evaluate LLM prompt templates for educational applications, finding that a prompt combining persona and context manager patterns for strategic reading outperformed others.
Details
Motivation: As LLMs become more common in education, there's a need for evidence-based methods to design and evaluate prompts that produce personalized and pedagogically aligned outputs, moving beyond ad-hoc prompt engineering.Method: Created six prompt templates with different pedagogical strategies, then used a tournament-style evaluation framework with Glicko2 rating system. Eight judges evaluated question pairs across three dimensions (format, dialogue support, appropriateness) using 120 authentic user interactions from three educational deployments.
Result: A single prompt template related to strategic reading outperformed all others with win probabilities ranging from 81% to 100% in pairwise comparisons. This prompt combined persona and context manager patterns and was designed to support metacognitive learning strategies like self-directed learning.
Conclusion: The methodology provides educational technology researchers with a systematic way to evaluate and improve prompt designs, enabling evidence-based prompt development for educational applications rather than relying on ad-hoc approaches.
Abstract: As large language models (LLMs) become increasingly common in educational applications, there is a growing need for evidence-based methods to design and evaluate LLM prompts that produce personalized and pedagogically aligned out-puts. This study presents a generalizable, systematic approach for evaluating prompts, demonstrated through an analysis of LLM-generated follow-up questions in a structured dialogue activity. Six prompt templates were designed and tested. The templates incorporated established prompt engineering patterns, with each prompt emphasizing distinct pedagogical strategies. The prompt templates were compared through a tournament-style evaluation framework that can be adapted for other educational applications. The tournament employed the Glicko2 rating system with eight judges evaluating question pairs across three dimensions: format, dialogue support, and appropriateness for learners. Data was sourced from 120 authentic user interactions across three distinct educational deployments. Results showed that a single prompt related to strategic reading out-performed other templates with win probabilities ranging from 81% to 100% in pairwise comparisons. This prompt combined persona and context manager pat-terns and was designed to support metacognitive learning strategies such as self-directed learning. The methodology showcases how educational technology re- searchers can systematically evaluate and improve prompt designs, moving beyond ad-hoc prompt engineering toward evidence-based prompt development for educational applications.
[247] Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, Jinwei Gu
Main category: cs.AI
TL;DR: Cosmos Policy adapts a pretrained video model (Cosmos-Predict2) into a robot policy through single-stage post-training, generating actions as latent frames and enabling test-time planning with state-of-the-art performance on benchmarks.
Details
Motivation: Video generation models have strong spatiotemporal priors but adapting them for robotics typically requires complex multi-stage training and architectural changes. The goal is to leverage these pretrained models more simply and effectively for robot policy learning.Method: Single-stage post-training of Cosmos-Predict2 video model on robot demonstrations without architectural modifications. Actions are encoded as latent frames within the model’s latent diffusion process. Also generates future state images and values for test-time planning.
Result: Achieves SOTA on LIBERO (98.5%) and RoboCasa (67.1%) simulation benchmarks, highest average score in real-world bimanual manipulation tasks. Outperforms diffusion policies trained from scratch, video model-based policies, and fine-tuned vision-language-action models.
Conclusion: Cosmos Policy demonstrates that pretrained video models can be effectively adapted into robot policies through simple post-training, leveraging their priors for action generation and planning while achieving superior performance across simulation and real-world tasks.
Abstract: Recent video generation models demonstrate remarkable ability to capture complex physical interactions and scene evolution over time. To leverage their spatiotemporal priors, robotics works have adapted video models for policy learning but introduce complexity by requiring multiple stages of post-training and new architectural components for action generation. In this work, we introduce Cosmos Policy, a simple approach for adapting a large pretrained video model (Cosmos-Predict2) into an effective robot policy through a single stage of post-training on the robot demonstration data collected on the target platform, with no architectural modifications. Cosmos Policy learns to directly generate robot actions encoded as latent frames within the video model’s latent diffusion process, harnessing the model’s pretrained priors and core learning algorithm to capture complex action distributions. Additionally, Cosmos Policy generates future state images and values (expected cumulative rewards), which are similarly encoded as latent frames, enabling test-time planning of action trajectories with higher likelihood of success. In our evaluations, Cosmos Policy achieves state-of-the-art performance on the LIBERO and RoboCasa simulation benchmarks (98.5% and 67.1% average success rates, respectively) and the highest average score in challenging real-world bimanual manipulation tasks, outperforming strong diffusion policies trained from scratch, video model-based policies, and state-of-the-art vision-language-action models fine-tuned on the same robot demonstrations. Furthermore, given policy rollout data, Cosmos Policy can learn from experience to refine its world model and value function and leverage model-based planning to achieve even higher success rates in challenging tasks. We release code, models, and training data at https://research.nvidia.com/labs/dir/cosmos-policy/
[248] Structured Hints for Sample-Efficient Lean Theorem Proving
Zachary Burton
Main category: cs.AI
TL;DR: Simple inference-time prompting with tactic skeletons boosts theorem prover performance by 43% on miniF2F benchmark
Details
Motivation: Despite sophisticated RL training, state-of-the-art neural theorem provers may underutilize structural priors available in tactic languages, leaving room for simple inference-time guidanceMethod: Lightweight intervention using fixed prompt schedule over 15 common tactic skeletons during inference, evaluated on miniF2F benchmark with same sampling parameters (k=16, max 1024 tokens)
Result: 21.7% pass@16 vs 15.2% for standard sampling (43% relative improvement) using same model and resources
Conclusion: Simple inference-time guidance remains a cheap, complementary boost for RL-trained theorem provers, suggesting they underutilize available structural priors
Abstract: State-of-the-art neural theorem provers like DeepSeek-Prover-V1.5 combine large language models with reinforcement learning, achieving impressive results through sophisticated training. We ask: do these highly-trained models still benefit from simple structural guidance at inference time? We evaluate a lightweight intervention – a fixed prompt schedule over 15 common tactic skeletons – on the miniF2F benchmark. This simple approach yields 21.7% pass@16 compared to 15.2% for standard sampling from the same model, a 43% relative improvement using the same number of samples (k=16) and same maximum generation length (1024 tokens). Our results suggest that even capable RL-trained provers underutilize structural priors available in the tactic language, and that simple inference-time guidance remains a cheap, complementary boost.
[249] Scalable Board Expansion within a General Game System
Clémentine Sacré
Main category: cs.AI
TL;DR: Dynamic board expansion system for boardless games using General Game System to automatically grow game boards during play instead of using oversized static boards.
Details
Motivation: Traditional boardless games use oversized static boards defined from the start, even though large portions may never be used, leading to unnecessary complexity and inefficiency.Method: Proposes a dynamic board expansion mechanism using a General Game System (GGS) where the game board grows automatically during gameplay as needed.
Result: Not specified in the provided abstract, but presumably demonstrates reduced complexity and more efficient resource usage compared to traditional static board approaches.
Conclusion: Dynamic board expansion using GGS addresses the inefficiency of traditional oversized static boards in boardless games, potentially improving game performance and reducing unnecessary complexity.
Abstract: This thesis explores the use of a General Game System (GGS) to support the automatic expansion of game boards in boardless games. Traditional implementations of such games often rely on oversized static boards defined from the start, even though large portions of these boards may never be used during gameplay. This approach leads to unnecessary complexity. To address this issue, this thesis propose a dynamic board expansion mechanism in which the game board grows automatically during play.
[250] Thought of Search: Planning with Language Models Through The Lens of Efficiency
Michael Katz, Harsha Kokel, Kavitha Srinivas, Shirin Sohrabi
Main category: cs.AI
TL;DR: LLM-based planning methods often sacrifice soundness and completeness for efficiency; this paper proposes a more efficient approach that maintains both properties by using LLMs to generate search code rather than direct planning.
Details
Motivation: Recent LLM-based planning methods lack analysis of fundamental algorithmic properties like soundness, completeness, and complexity, often sacrificing soundness and completeness for efficiency.Method: Use LLMs to generate code for search components rather than direct planning, enabling efficient search algorithms that maintain soundness and completeness properties.
Result: Achieved 100% accuracy on four representative search problems with only a few LLM calls, significantly outperforming existing LLM-based planning approaches.
Conclusion: LLMs should be used responsibly to generate sound and complete search algorithms rather than for direct planning, urging research toward efficient LLM-based approaches that uphold fundamental algorithmic properties.
Abstract: Among the most important properties of algorithms investigated in computer science are soundness, completeness, and complexity. These properties, however, are rarely analyzed for the vast collection of recently proposed methods for planning with large language models. In this work, we alleviate this gap. We analyse these properties of using LLMs for planning and highlight that recent trends abandon both soundness and completeness for the sake of inefficiency. We propose a significantly more efficient approach that can, at the same time, maintain both soundness and completeness. We exemplify on four representative search problems, comparing to the LLM-based solutions from the literature that attempt to solve these problems. We show that by using LLMs to produce the code for the search components we can solve the entire datasets with 100% accuracy with only a few calls to the LLM. We argue for a responsible use of compute resources; urging research community to investigate sound and complete LLM-based approaches that uphold efficiency.
[251] Information-theoretic Distinctions Between Deception and Confusion
Robin Young
Main category: cs.AI
TL;DR: The paper proposes an information-theoretic framework to distinguish between deceptive alignment (agent hides true goals) and goal drift (agent loses track of intended goals), showing they require different interventions despite similar appearances.
Details
Motivation: To formally distinguish between two fundamental AI safety failure modes - deceptive alignment and goal drift - which appear similar but represent different information divergences and require different interventions.Method: Develops an information-theoretic formalization and formal model, presents an illustrative thought experiment, and offers a formal language for analyzing alignment challenges in LLMs.
Result: Demonstrates that deceptive alignment creates entropy between agent’s true goals and observable behavior, while goal drift creates entropy between intended human goals and agent’s actual goals.
Conclusion: Provides a formal framework for distinguishing deceptive alignment from goal drift, offering new perspectives on LLM alignment challenges and enabling targeted interventions for different failure modes.
Abstract: We propose an information-theoretic formalization of the distinction between two fundamental AI safety failure modes: deceptive alignment and goal drift. While both can lead to systems that appear misaligned, we demonstrate that they represent distinct forms of information divergence occurring at different interfaces in the human-AI system. Deceptive alignment creates entropy between an agent’s true goals and its observable behavior, while goal drift, or confusion, creates entropy between the intended human goal and the agent’s actual goal. Though often observationally equivalent, these failures necessitate different interventions. We present a formal model and an illustrative thought experiment to clarify this distinction. We offer a formal language for re-examining prominent alignment challenges observed in Large Language Models (LLMs), offering novel perspectives on their underlying causes.
[252] Chat-TS: Enhancing Multi-Modal Reasoning Over Time-Series and Natural Language Data
Paul Quinlan, Qingguo Li, Xiaodan Zhu
Main category: cs.AI
TL;DR: Chat-TS is an LLM framework that integrates time-series tokens into language models to enable reasoning over both time-series and textual data, achieving SOTA performance in multimodal reasoning tasks.
Details
Motivation: Current LLMs are limited in their ability to perform reasoning that involves both time-series data and textual content, despite time-series being fundamental in many fields like healthcare, finance, transportation, and energy where LLMs are being deployed.Method: Chat-TS integrates time-series tokens into LLMs’ vocabulary without compromising natural language capabilities. It uses a training strategy to preserve LLMs’ inherent reasoning while augmenting them for time-series reasoning. The framework is supported by new datasets: TS Instruct Training Dataset, TS Instruct QA Gold Dataset, and TS Instruct Quantitative Probing Set.
Result: Chat-TS achieves state-of-the-art performance in multimodal reasoning tasks, maintaining strong natural language proficiency while improving time-series reasoning capabilities.
Conclusion: Chat-TS successfully bridges the gap in multimodal reasoning involving time-series and textual data, providing a framework that enhances LLMs’ ability to reason over both modalities while preserving their core language capabilities.
Abstract: Large language models are being rapidly deployed across many fields such as healthcare, finance, transportation, and energy, where time-series data are fundamental components. The current works are still limited in their ability to perform reasoning that involves both time-series and the corresponding textual content. We address this gap by introducing Chat-TS, a large language model (LLM) based framework designed to support reasoning over time series and textual data. Unlike traditional models, Chat-TS integrates time-series tokens into LLMs’ vocabulary, enhancing its reasoning ability over both modalities without compromising core natural language capabilities. To support learning and evaluation, we contribute new datasets: the TS Instruct Training Dataset (pairing diverse time-series data with relevant text instructions and responses for instruction tuning), the TS Instruct Question and Answer (QA) Gold Dataset (multiple-choice questions to evaluate multimodal reasoning), and a TS Instruct Quantitative Probing Set (a small subset of TS Instruct QA reasoning tasks alongside math and decision-making questions for LLM evaluation). We design a training strategy to preserve the inherent reasoning capabilities of LLMs while augmenting them for time-series reasoning. Experiments show that Chat-TS achieves state-of-the-art performance in multimodal reasoning tasks by maintaining strong natural language proficiency while improving time-series reasoning.
[253] A Scalable Predictive Modelling Approach to Identifying Duplicate Adverse Event Reports for Drugs and Vaccines
Jim W. Barrett, Nils Erlanson, Joana Félix China, G. Niklas Norén
Main category: cs.AI
TL;DR: Extended vigiMatch from probabilistic record linkage to predictive modeling with SVM classifiers, improving duplicate detection precision to 92% for vaccines and 54% for medicines compared to 41% baseline.
Details
Motivation: Unlinked duplicate adverse event reports in pharmacovigilance databases impede statistical analysis and may mislead clinical assessment. Current state-of-the-art methods (like vigiMatch) have inconsistent performance across countries, especially for vaccine reports.Method: Extended vigiMatch from probabilistic record linkage to predictive modeling using SVM classifiers. Refined features for medicines, vaccines, and adverse events using country-specific reporting rates, extracted dates from free text, and trained separate SVM classifiers for medicines and vaccines.
Result: Precision: 92% for vaccines and 54% for medicines (vs 41% comparator). Recall: 80-85% for vaccines and 40-86% for medicines (vs 24-53% comparator).
Conclusion: Predictive modeling, use of free text, and country-specific features advance state-of-the-art for duplicate detection in pharmacovigilance, achieving more consistent performance across countries.
Abstract: Objectives: To advance state-of-the-art for duplicate detection in large-scale pharmacovigilance databases and achieve more consistent performance across adverse event reports from different countries. Background: Unlinked adverse event reports referring to the same case impede statistical analysis and may mislead clinical assessment. Pharmacovigilance relies on large databases of adverse event reports to discover potential new causal associations, and computational methods are required to identify duplicates at scale. Current state-of-the-art is statistical record linkage which outperforms rule-based approaches. In particular, vigiMatch is in routine use for VigiBase, the WHO global database of adverse event reports, and represents the first statistical duplicate detection approach in pharmacovigilance deployed at scale. Originally developed for both medicines and vaccines, its application to vaccines has been limited due to inconsistent performance across countries. Methods: This paper extends vigiMatch from probabilistic record linkage to predictive modelling, refining features for medicines, vaccines, and adverse events using country-specific reporting rates, extracting dates from free text, and training separate support vector machine classifiers for medicines and vaccines. Recall was evaluated using 5 independent labelled test sets. Precision was assessed by annotating random selections of report pairs classified as duplicates. Results: Precision for the new method was 92% for vaccines and 54% for medicines, compared with 41% for the comparator method. Recall ranged from 80-85% across test sets for vaccines and from 40-86% for medicines, compared with 24-53% for the comparator method. Conclusion: Predictive modeling, use of free text, and country-specific features advance state-of-the-art for duplicate detection in pharmacovigilance.
[254] Embracing Ambiguity: Bayesian Nonparametrics and Stakeholder Participation for Ambiguity-Aware Safety Evaluation
Yanan Long
Main category: cs.AI
TL;DR: The paper proposes a framework for evaluating generative AI models that moves beyond single-number metrics to analyze the distribution of harmful behavior across different decoding configurations and prompts, focusing on tail risks and stakeholder preferences.
Details
Motivation: Current evaluations of generative AI models collapse nuanced behavior into single number metrics computed for specific decoding configurations, which obscures tail risks, demographic disparities, and the existence of multiple near-optimal operating points.Method: The framework includes: (1) formalizing decoding Rashomon sets (regions of knob space with near-optimal risk), (2) developing a dependent Dirichlet process mixture with stakeholder-conditioned stick-breaking weights to learn multi-modal harm surfaces, and (3) creating an active sampling pipeline using Bayesian deep learning surrogates to explore knob space efficiently.
Result: The approach bridges multiplicity theory, Bayesian nonparametrics, and stakeholder-aligned sensitivity analysis, providing a more comprehensive evaluation framework for generative models.
Conclusion: The proposed framework paves the way for more trustworthy deployment of generative AI models by embracing multiplicity in model behavior and integrating stakeholder preferences into risk assessment.
Abstract: Evaluations of generative AI models often collapse nuanced behaviour into a single number computed for a single decoding configuration. Such point estimates obscure tail risks, demographic disparities, and the existence of multiple near-optimal operating points. We propose a unified framework that embraces multiplicity by modelling the distribution of harmful behaviour across the entire space of decoding knobs and prompts, quantifying risk through tail-focused metrics, and integrating stakeholder preferences. Our technical contributions are threefold: (i) we formalise decoding Rashomon sets, regions of knob space whose risk is near-optimal under given criteria and measure their size and disagreement; (ii) we develop a dependent Dirichlet process (DDP) mixture with stakeholder-conditioned stick-breaking weights to learn multi-modal harm surfaces; and (iii) we introduce an active sampling pipeline that uses Bayesian deep learning surrogates to explore knob space efficiently. Our approach bridges multiplicity theory, Bayesian nonparametrics, and stakeholder-aligned sensitivity analysis, paving the way for trustworthy deployment of generative models.
[255] A large-scale evaluation of commonsense knowledge in humans and large language models
Tuan Dung Nguyen, Duncan J. Watts, Mark E. Whiting
Main category: cs.AI
TL;DR: LLMs show limited commonsense competence compared to humans, with smaller open models surprisingly outperforming larger proprietary ones. The paper proposes a framework that accounts for human diversity in commonsense judgments.
Details
Motivation: Current AI commonsense evaluation assumes homogeneous human judgment, but humans actually vary enormously in what they consider commonsensical. This creates a mismatch between benchmark labels and real-world commonsense diversity.Method: Proposes evaluating LLMs by measuring correspondence between model judgments and human population judgments, treating models as both independent respondents and simulators of hypothetical populations. Uses empirical human heterogeneity data.
Result: Most LLMs fall below human median in individual commonsense competence. LLMs correlate only modestly with real humans in agreement patterns. Smaller open-weight models surprisingly outperform larger proprietary frontier models.
Conclusion: AI commonsense evaluation must account for cultural and social diversity in human knowledge. The framework contributes to adapting AI models to different human collectivities with incompatible social stocks of knowledge.
Abstract: Commonsense knowledge, a major constituent of artificial intelligence (AI), is primarily evaluated in practice by human-prescribed ground-truth labels. An important, albeit implicit, assumption of these labels is that they accurately capture what any human would think, effectively treating human common sense as homogeneous. However, recent empirical work has shown that humans vary enormously in what they consider commonsensical; thus what appears self-evident to one benchmark designer may not be so to another. Here, we propose a method for assessing commonsense knowledge in AI, specifically in large language models (LLMs), that incorporates empirically observed heterogeneity among humans by measuring the correspondence between a model’s judgment and that of a human population. We first find that, when treated as independent survey respondents, most LLMs remain below the human median in their individual commonsense competence. Second, when used as simulators of a hypothetical population, LLMs correlate with real humans only modestly in the extent to which they agree on the same set of statements. In both cases, smaller, open-weight models are surprisingly more competitive than larger, proprietary frontier models. Our evaluation framework, which ties commonsense knowledge to its cultural basis, contributes to the growing call for adapting AI models to human collectivities that possess different, often incompatible, social stocks of knowledge.
[256] VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning
Li Kang, Xiufeng Song, Heng Zhou, Yiran Qin, Jie Yang, Xiaohong Liu, Philip Torr, Lei Bai, Zhenfei Yin
Main category: cs.AI
TL;DR: VIKI-Bench is the first hierarchical benchmark for embodied multi-agent cooperation with visual reasoning, and VIKI-R is a two-stage VLM fine-tuning + RL framework that outperforms baselines and enables emergent cooperation patterns.
Details
Motivation: Existing VLM-based approaches for multi-agent cooperation are limited in supporting diverse embodiment types and lack comprehensive benchmarks for visual reasoning in embodied multi-agent systems.Method: VIKI-Bench: hierarchical benchmark with three levels (agent activation, task planning, trajectory perception), diverse robot embodiments, multi-view visual observations, and structured supervision. VIKI-R: two-stage framework that fine-tunes pretrained VLM using Chain-of-Thought annotated demonstrations, followed by reinforcement learning with multi-level reward signals.
Result: VIKI-R significantly outperforms baseline methods across all task levels. Reinforcement learning enables emergence of compositional cooperation patterns among heterogeneous agents.
Conclusion: VIKI-Bench and VIKI-R provide a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems, addressing limitations of current VLM-based approaches.
Abstract: Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a two-stage framework that fine-tunes a pretrained vision-language model (VLM) using Chain-of-Thought annotated demonstrations, followed by reinforcement learning under multi-level reward signals. Our extensive experiments show that VIKI-R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI-Bench and VIKI-R offer a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems.
[257] FormGym: Doing Paperwork with Agents
Matthew Toles, Rattandeep Singh, Isaac Song, Zhou Yu
Main category: cs.AI
TL;DR: A new form-filling benchmark shows current vision-language agents and GUI agents struggle with pure-image form completion, but a new tool called FieldFinder significantly improves performance by helping LLMs locate where to place text.
Details
Motivation: Form filling in the pure-image domain without OCR, PDF text, or DOM access is challenging for computer agents, requiring multi-modal understanding, information retrieval, and tool-use capabilities. Current solutions are inadequate.Method: Created a novel form-filling benchmark with 432 fields across 55 documents and 3 tasks, requiring knowledge of 236 user features. Developed FieldFinder, a tool to assist LLMs in identifying where to place text on forms.
Result: Baseline VLAs achieved less than 1% accuracy due to poor localization. GUI agents scored 10.6-68.0% despite high cost/latency. With FieldFinder, all models achieved equal or better performance, with maximum improvement from 2% to 56%.
Conclusion: FieldFinder effectively addresses the localization challenge in form filling, significantly improving agent performance and demonstrating the value of specialized tools for assisting LLMs in visual document understanding tasks.
Abstract: Completing paperwork is a challenging and time-consuming problem. Form filling is especially challenging in the pure-image domain without access to OCR, typeset PDF text, or a DOM. For computer agents, it requires multiple abilities, including multi-modal understanding, information retrieval, and tool-use. We present a novel form-filling benchmark consisting of 432 fields spread across 55 documents and 3 tasks, requiring knowledge of 236 features per user. We find that baseline VLAs achieve less than 1% accuracy in most cases, primarily due to poor localization ability. GUI agents also struggle, scoring between 10.6-68.0% despite high cost and latency. Therefore, we also contribute FieldFinder, a tool to assist LLMs in identifying where to place text on a form. With FieldFinder, all models achieve equal or better performance in all six study conditions, with a maximum increase from 2% to 56%.
[258] SURE-Med: Systematic Uncertainty Reduction for Enhanced Reliability in Medical Report Generation
Yuhang Gu, Xingyu Hu, Yuyu Fan, Xulin Yan, Longhuan Xu, Peng peng
Main category: cs.AI
TL;DR: SURE-Med is a unified framework that addresses three key uncertainties in medical report generation: visual uncertainty from noisy view annotations, label distribution uncertainty from long-tailed disease prevalence, and contextual uncertainty from unverified historical reports, achieving state-of-the-art performance.
Details
Motivation: Clinical deployment of automated medical report generation is hindered by three major uncertainties: visual uncertainty from incorrect view annotations compromising feature extraction, label distribution uncertainty from long-tailed disease prevalence biasing models against rare conditions, and contextual uncertainty from unverified historical reports causing factual hallucinations.Method: SURE-Med uses three modules: 1) Frontal-Aware View Repair Resampling to correct view annotation errors and adaptively select informative features, 2) Token Sensitive Learning to enhance modeling of critical diagnostic sentences and reweight underrepresented terms, and 3) Contextual Evidence Filter to validate and selectively incorporate prior information aligned with current images.
Result: Extensive experiments on MIMIC-CXR and IU-Xray benchmarks demonstrate that SURE-Med achieves state-of-the-art performance in medical report generation by holistically reducing uncertainty across multiple input modalities.
Conclusion: SURE-Med sets a new benchmark for reliability in medical report generation and offers a robust step toward trustworthy clinical decision support by systematically addressing visual, distributional, and contextual uncertainties.
Abstract: Automated medical report generation (MRG) holds great promise for reducing the heavy workload of radiologists. However, its clinical deployment is hindered by three major sources of uncertainty. First, visual uncertainty, caused by noisy or incorrect view annotations, compromises feature extraction. Second, label distribution uncertainty, stemming from long-tailed disease prevalence, biases models against rare but clinically critical conditions. Third, contextual uncertainty, introduced by unverified historical reports, often leads to factual hallucinations. These challenges collectively limit the reliability and clinical trustworthiness of MRG systems. To address these issues, we propose SURE-Med, a unified framework that systematically reduces uncertainty across three critical dimensions: visual, distributional, and contextual. To mitigate visual uncertainty, a Frontal-Aware View Repair Resampling module corrects view annotation errors and adaptively selects informative features from supplementary views. To tackle label distribution uncertainty, we introduce a Token Sensitive Learning objective that enhances the modeling of critical diagnostic sentences while reweighting underrepresented diagnostic terms, thereby improving sensitivity to infrequent conditions. To reduce contextual uncertainty, our Contextual Evidence Filter validates and selectively incorporates prior information that aligns with the current image, effectively suppressing hallucinations. Extensive experiments on the MIMIC-CXR and IU-Xray benchmarks demonstrate that SURE-Med achieves state-of-the-art performance. By holistically reducing uncertainty across multiple input modalities, SURE-Med sets a new benchmark for reliability in medical report generation and offers a robust step toward trustworthy clinical decision support.
[259] Evolving in Tasks: Empowering the Multi-modality Large Language Model as the Computer Use Agent
Yuhao Cheng, Liang Tang, Shuxian Li, Yukang Huo, Tiaonan Duan, Kaer Huang, Yanzhe Jing, Yiqiang Yan
Main category: cs.AI
TL;DR: SEA (Self-Evolution Agent) is a 7B-parameter computer use agent that outperforms same-scale models and matches larger models through automated data generation, efficient reinforcement learning, and integrated grounding-planning capabilities.
Details
Motivation: Existing computer use agents have insufficient performance for practical deployment despite significant industry and academic attention. There's a need for more capable autonomous agents that can effectively operate computers to fulfill user tasks.Method: Three core innovations: 1) Automatic pipeline for generating verifiable task trajectories for training, 2) Efficient Step-wise Reinforcement Learning to reduce computational overhead of long-horizon training, and 3) Model enhancement method that integrates grounding and planning capabilities into a single model without additional training.
Result: SEA with only 7B parameters outperforms existing models of the same parameter scale and achieves performance comparable to larger models (e.g., 32B/72B parameters) on computer use tasks.
Conclusion: The proposed SEA demonstrates that through innovative data generation, efficient training, and model enhancement techniques, smaller models can achieve performance comparable to much larger models in computer operation tasks, with plans to release the model and code as open-source.
Abstract: Computer use agents represent an emerging area in artificial intelligence, aiming to operate computers autonomously to fulfill user tasks, attracting significant attention from both industry and academia. However, the performance of existing agents remains insufficient for practical deployment. In this paper, we propose the Self-Evolution Agent (SEA) for computer operation, alongside three core innovations in data generation, reinforcement learning, and model enhancement to develop this agent. Specifically, we first design an automatic pipeline to generate verifiable task trajectories for training. Second, we propose Efficient Step-wise Reinforcement Learning to reduce the substantial computational overhead of long-horizon training. Finally, we introduce a model enhancement method that integrates grounding and planning capabilities into a single model without additional training. Leveraging these innovations, our SEA (with only 7B parameters) outperforms existing models of the same parameter scale and achieves performance comparable to larger models (e.g., 32B/72B parameters) on computer use tasks. We plan to release the model weights and related code as open-source resources in the future.
[260] Mantis: A Foundation Model for Mechanistic Disease Forecasting
Carson Dudley, Reiden Magdaleno, Christopher Harding, Ananya Sharma, Emily Martin, Marisa Eisenberg
Main category: cs.AI
TL;DR: Mantis is a simulation-trained foundation model for infectious disease forecasting that works across diseases/regions without real-world training data, outperforming traditional models in accuracy and generalization.
Details
Motivation: Traditional disease forecasting requires large datasets, bespoke training, and expert tuning, which hinders rapid forecasting in novel outbreaks or low-resource settings with limited historical data.Method: Developed Mantis, a foundation model trained entirely on mechanistic simulations (no real-world data), enabling out-of-the-box forecasting across diseases, regions, and outcomes.
Result: Mantis achieved lower mean absolute error than all CDC COVID-19 Forecast Hub models on early pandemic forecasts, consistently ranked top two across six diseases, and generalized to diseases with transmission mechanisms not in training data.
Conclusion: Purely simulation-based foundation models like Mantis provide a practical foundation for disease forecasting: general-purpose, accurate, and deployable where traditional models struggle.
Abstract: Infectious disease forecasting in novel outbreaks or low-resource settings is hampered by the need for large disease and covariate data sets, bespoke training, and expert tuning, all of which can hinder rapid generation of forecasts for new settings. To help address these challenges, we developed Mantis, a foundation model trained entirely on mechanistic simulations, which enables out-of-the-box forecasting across diseases, regions, and outcomes, even in settings with limited historical data. We evaluated Mantis against 48 forecasting models across six diseases with diverse modes of transmission, assessing both point forecast accuracy (mean absolute error) and probabilistic performance (weighted interval score and coverage). Despite using no real-world data during training, Mantis achieved lower mean absolute error than all models in the CDC’s COVID-19 Forecast Hub when backtested on early pandemic forecasts which it had not previously seen. Across all other diseases tested, Mantis consistently ranked in the top two models across evaluation metrics. Mantis further generalized to diseases with transmission mechanisms not represented in its training data, demonstrating that it can capture fundamental contagion dynamics rather than memorizing disease-specific patterns. These capabilities illustrate that purely simulation-based foundation models such as Mantis can provide a practical foundation for disease forecasting: general-purpose, accurate, and deployable where traditional models struggle.
[261] BPMN Assistant: An LLM-Based Approach to Business Process Modeling
Josip Tomo Licardo, Nikola Tankovic, Darko Etinger
Main category: cs.AI
TL;DR: BPMN Assistant uses LLMs with JSON-based intermediate representation for natural language creation/editing of BPMN diagrams, outperforming direct XML manipulation in speed, reliability, and efficiency.
Details
Motivation: Direct XML generation for BPMN diagrams is verbose, slow, and error-prone during complex modifications, creating a need for more efficient and reliable natural language-based editing tools.Method: Introduces a specialized JSON-based intermediate representation designed for atomic editing operations through function calling, leveraging LLMs (GPT-5.1, Claude 4.5 Sonnet, DeepSeek V3) for natural language processing.
Result: JSON-based approach significantly outperforms direct XML in editing tasks with higher/equivalent success rates across all models, reduces generation latency by ~43%, and cuts output token count by over 75%.
Conclusion: The JSON-based intermediate representation offers a more reliable and responsive solution for interactive process modeling compared to traditional direct XML manipulation approaches.
Abstract: This paper presents BPMN Assistant, a tool that leverages Large Language Models for natural language-based creation and editing of BPMN diagrams. While direct XML generation is common, it is verbose, slow, and prone to syntax errors during complex modifications. We introduce a specialized JSON-based intermediate representation designed to facilitate atomic editing operations through function calling. We evaluate our approach against direct XML manipulation using a suite of state-of-the-art models, including GPT-5.1, Claude 4.5 Sonnet, and DeepSeek V3. Results demonstrate that the JSON-based approach significantly outperforms direct XML in editing tasks, achieving higher or equivalent success rates across all evaluated models. Furthermore, despite requiring more input context, our approach reduces generation latency by approximately 43% and output token count by over 75%, offering a more reliable and responsive solution for interactive process modeling.
[262] Enhanced Fish Freshness Classification with Incremental Handcrafted Feature Fusion
Phi-Hung Hoang, Nam-Thuan Trinh, Van-Manh Tran, Thi-Thu-Hong Phan
Main category: cs.AI
TL;DR: Handcrafted feature-based approach using color statistics, histograms, and texture features from fish eye images achieves 97.49% accuracy for automated fish freshness assessment, significantly outperforming previous deep learning methods.
Details
Motivation: Conventional sensory evaluation of fish freshness is subjective, inconsistent, and difficult to standardize, with limitations in detecting subtle, species-dependent spoilage cues, creating challenges for food quality, market value, and consumer health.Method: Systematically extracts and incrementally fuses complementary handcrafted features including color statistics, histograms across multiple color spaces, and texture features (LBP and GLCM) from fish eye images, capturing both global chromatic variations and localized degradations from ROI segments.
Result: LightGBM classifier achieved 77.56% accuracy (14.35% improvement over previous 63.21% deep learning baseline). With augmented data, ANN reached 97.49% accuracy, surpassing prior best of 77.3% by 20.19% on the FFE dataset.
Conclusion: Carefully engineered handcrafted features provide a robust, interpretable, and reliable solution for automated fish freshness assessment, offering valuable insights for practical food quality monitoring applications.
Abstract: Accurate assessment of fish freshness remains a major challenge in the food industry, with direct consequences for product quality, market value, and consumer health. Conventional sensory evaluation is inherently subjective, inconsistent, and difficult to standardize across contexts, often limited by subtle, species-dependent spoilage cues. To address these limitations, we propose a handcrafted feature-based approach that systematically extracts and incrementally fuses complementary descriptors, including color statistics, histograms across multiple color spaces, and texture features such as Local Binary Patterns (LBP) and Gray-Level Co-occurrence Matrices (GLCM), from fish eye images. Our method captures global chromatic variations from full images and localized degradations from ROI segments, fusing each independently to evaluate their effectiveness in assessing freshness. Experiments on the Freshness of the Fish Eyes (FFE) dataset demonstrate the approach’s effectiveness: in a standard train-test setting, a LightGBM classifier achieved 77.56% accuracy, a 14.35% improvement over the previous deep learning baseline of 63.21%. With augmented data, an Artificial Neural Network (ANN) reached 97.49% accuracy, surpassing the prior best of 77.3% by 20.19%. These results demonstrate that carefully engineered, handcrafted features, when strategically processed, yield a robust, interpretable, and reliable solution for automated fish freshness assessment, providing valuable insights for practical applications in food quality monitoring.
[263] Graph Neural Networks, Deep Reinforcement Learning and Probabilistic Topic Modeling for Strategic Multiagent Settings
Georgios Chalkiadakis, Charilaos Akasiadis, Gerasimos Koresis, Stergios Plataniotis, Leonidas Bakopoulos
Main category: cs.AI
TL;DR: This paper reviews GNN, DRL, and PTM methods for strategic multiagent settings, focusing on opponent modeling without relying on unrealistic game theory assumptions like CPA and SIH, while addressing uncertainty, heterogeneity, and scalability challenges.
Details
Motivation: To address the limitations of traditional game theory in real-world multiagent settings where assumptions like Common Prior Assumption and Self-Interest Hypothesis often fail, and to explore how modern ML methods (GNN, DRL, PTM) can better model strategic interactions with uncertainty and heterogeneity.Method: Comprehensive review of three main approaches: (1) Graph Neural Networks (GNN) for modeling relationships and interactions in graph-structured multiagent data, (2) Multiagent Deep Reinforcement Learning (DRL) for decision-making in non-stationary environments, and (3) Probabilistic Topic Modeling (PTM) applied beyond document analysis to strategic settings.
Result: The review identifies GNN as particularly promising for modeling multiagent relationships, analyzes how DRL can address non-stationarity challenges, and explores PTM’s potential in strategic domains. It also evaluates existing game theoretic concepts for fairness and stability.
Conclusion: Key open challenges remain: fitting non-stationary environments, balancing stability vs adaptation, tackling uncertainty and heterogeneity, and ensuring scalability. The paper advocates for integrating ML methods with game theory while avoiding unrealistic assumptions common in traditional approaches.
Abstract: This paper provides a comprehensive review of mainly GNN, DRL, and PTM methods with a focus on their potential incorporation in strategic multiagent settings. We draw interest in (i) ML methods currently utilized for uncovering unknown model structures adaptable to the task of strategic opponent modeling, and (ii) the integration of these methods with Game Theoretic concepts that avoid relying on assumptions often invalid in real-world scenarios, such as the Common Prior Assumption (CPA) and the Self-Interest Hypothesis (SIH). We analyze the ability to handle uncertainty and heterogeneity, two characteristics that are very common in real-world application cases, as well as scalability. As a potential answer to effectively modeling relationships and interactions in multiagent settings, we champion the use of GNN. Such approaches are designed to operate upon graph-structured data, and have been shown to be a very powerful tool for performing tasks such as node classification and link prediction. Next, we review the domain of RL, and in particular that of multiagent deep reinforcement learning. Single-agent deep RL has been widely used for decision making in demanding game settings. Its application in multiagent settings though is hindered due to, e.g., varying relationships between agents, and non-stationarity of the environment. We describe existing relevant game theoretic solution concepts, and consider properties such as fairness and stability. Our review comes complete with a note on the literature that utilizes probabilistic topic modeling (PTM) in domains other than that of document analysis and classification. Finally, we identify certain open challenges – specifically, the need to (i) fit non-stationary environments, (ii) balance the degrees of stability and adaptation, (iii) tackle uncertainty and heterogeneity, (iv) guarantee scalability and solution tractability.
[264] SimWorld-Robotics: Synthesizing Photorealistic and Dynamic Urban Environments for Multimodal Robot Navigation and Collaboration
Yan Zhuang, Jiawei Ren, Xiaokang Ye, Jianzhi Shen, Ruixuan Zhang, Tianai Yue, Muhammad Faayez, Xuhong He, Ziqiao Ma, Lianhui Qin, Zhiting Hu, Tianmin Shu
Main category: cs.AI
TL;DR: SWR is a photorealistic urban simulation platform for embodied AI with procedurally generated cities, supporting multi-robot control and communication, featuring two challenging benchmarks for evaluating robot capabilities in realistic urban scenarios.
Details
Motivation: Current foundation models for robotics focus mainly on indoor/household scenarios, lacking evaluation in large-scale, realistic urban environments with dynamic elements like pedestrians and traffic systems.Method: Built SimWorld-Robotics (SWR) on Unreal Engine 5 with procedurally generated unlimited photorealistic urban scenes, dynamic elements (pedestrians, traffic), multi-robot control/communication, and created two benchmarks: multimodal instruction-following navigation and multi-agent search tasks.
Result: State-of-the-art models (including VLMs) struggle with the tasks, showing deficiencies in robust perception, reasoning, and planning abilities needed for urban environments, demonstrating the platform’s effectiveness in revealing model limitations.
Conclusion: SWR provides a comprehensive simulation platform for evaluating embodied AI in realistic urban scenarios, revealing critical gaps in current models’ capabilities for urban robotics and enabling development of more robust generalist robotics systems.
Abstract: Recent advances in foundation models have shown promising results in developing generalist robotics that can perform diverse tasks in open-ended scenarios given multimodal inputs. However, current work has been mainly focused on indoor, household scenarios. In this work, we present SimWorld-Robotics~(SWR), a simulation platform for embodied AI in large-scale, photorealistic urban environments. Built on Unreal Engine 5, SWR procedurally generates unlimited photorealistic urban scenes populated with dynamic elements such as pedestrians and traffic systems, surpassing prior urban simulations in realism, complexity, and scalability. It also supports multi-robot control and communication. With these key features, we build two challenging robot benchmarks: (1) a multimodal instruction-following task, where a robot must follow vision-language navigation instructions to reach a destination in the presence of pedestrians and traffic; and (2) a multi-agent search task, where two robots must communicate to cooperatively locate and meet each other. Unlike existing benchmarks, these two new benchmarks comprehensively evaluate a wide range of critical robot capacities in realistic scenarios, including (1) multimodal instructions grounding, (2) 3D spatial reasoning in large environments, (3) safe, long-range navigation with people and traffic, (4) multi-robot collaboration, and (5) grounded communication. Our experimental results demonstrate that state-of-the-art models, including vision-language models (VLMs), struggle with our tasks, lacking robust perception, reasoning, and planning abilities necessary for urban environments.
[265] Monadic Context Engineering
Yifan Zhang, Yang Yuan, Mengdi Wang, Andrew Chi-Chih Yao
Main category: cs.AI
TL;DR: MCE introduces a monadic framework for building robust AI agents by treating workflows as computational contexts with algebraic structures managing state, errors, and concurrency.
Details
Motivation: Current LLM-based agent architectures use brittle, ad hoc patterns that struggle with state management, error handling, and concurrency, creating unreliable systems.Method: Monadic Context Engineering (MCE) uses Functors, Applicative Functors, and Monads to provide formal algebraic foundations for agent design, with Monad Transformers enabling systematic composition of capabilities.
Result: MCE enables construction of complex, resilient, and efficient AI agents from simple, verifiable components, and extends to Meta-Agents for generative orchestration of sub-agent workflows.
Conclusion: MCE offers a principled, formal architectural paradigm that addresses brittleness in current agent systems through algebraic abstractions, enabling robust sequential and parallel execution with systematic composition.
Abstract: The proliferation of Large Language Models (LLMs) has catalyzed a shift towards autonomous agents capable of complex reasoning and tool use. However, current agent architectures are frequently constructed using imperative, ad hoc patterns. This results in brittle systems plagued by difficulties in state management, error handling, and concurrency. This paper introduces Monadic Context Engineering (MCE), a novel architectural paradigm leveraging the algebraic structures of Functors, Applicative Functors, and Monads to provide a formal foundation for agent design. MCE treats agent workflows as computational contexts where cross-cutting concerns, such as state propagation, short-circuiting error handling, and asynchronous execution, are managed intrinsically by the algebraic properties of the abstraction. We demonstrate how Monads enable robust sequential composition, how Applicatives provide a principled structure for parallel execution, and crucially, how Monad Transformers allow for the systematic composition of these capabilities. This layered approach enables developers to construct complex, resilient, and efficient AI agents from simple, independently verifiable components. We further extend this framework to describe Meta-Agents, which leverage MCE for generative orchestration, dynamically creating and managing sub-agent workflows through metaprogramming.
[266] MMP-A*: Multimodal Perception Enhanced Incremental Heuristic Search on Path Planning
Minh Hieu Ha, Khanh Ly Ta, Hung Phan, Tung Doan, Tung Dao, Dao Tran, Huynh Thi Thanh Binh
Main category: cs.AI
TL;DR: MMP-A* integrates vision-language models with adaptive decay to improve path planning by combining spatial grounding with geometric precision, reducing computational costs while maintaining near-optimal trajectories.
Details
Motivation: Classical A* is computationally expensive in large-scale scenarios, while text-only LLM-based approaches lack spatial grounding and produce incorrect waypoints in complex environments with dead ends and ambiguous boundaries.Method: MMP-A* integrates vision-language models for spatial grounding with a novel adaptive decay mechanism that dynamically regulates uncertain waypoint influence in the heuristic, anchoring high-level reasoning in physical geometry.
Result: Experimental results in challenging cluttered and topologically complex environments show MMP-A* achieves near-optimal trajectories with significantly reduced operational costs and memory overhead.
Conclusion: MMP-A* demonstrates potential as a perception-grounded and computationally efficient paradigm for autonomous navigation by combining multimodal reasoning with geometric validity.
Abstract: Autonomous path planning requires a synergy between global reasoning and geometric precision, especially in complex or cluttered environments. While classical A* is valued for its optimality, it incurs prohibitive computational and memory costs in large-scale scenarios. Recent attempts to mitigate these limitations by using Large Language Models for waypoint guidance remain insufficient, as they rely only on text-based reasoning without spatial grounding. As a result, such models often produce incorrect waypoints in topologically complex environments with dead ends, and lack the perceptual capacity to interpret ambiguous physical boundaries. These inconsistencies lead to costly corrective expansions and undermine the intended computational efficiency. We introduce MMP-A*, a multimodal framework that integrates the spatial grounding capabilities of vision-language models with a novel adaptive decay mechanism. By anchoring high-level reasoning in physical geometry, the framework produces coherent waypoint guidance that addresses the limitations of text-only planners. The adaptive decay mechanism dynamically regulates the influence of uncertain waypoints within the heuristic, ensuring geometric validity while substantially reducing memory overhead. To evaluate robustness, we test the framework in challenging environments characterized by severe clutter and topological complexity. Experimental results show that MMP-A* achieves near-optimal trajectories with significantly reduced operational costs, demonstrating its potential as a perception-grounded and computationally efficient paradigm for autonomous navigation.
[267] GDEPO: Group Dual-dynamic and Equal-right Advantage Policy Optimization with Enhanced Training Data Utilization for Sample-Constrained Reinforcement Learning
Zhengqing Yan, Xinyang Liu, Yi Zhang, Fan Guo, ChengXun Jia, Junchen Wan, Yao Liu, Qi Liu, Jihao Huang, Kang Song
Main category: cs.AI
TL;DR: GDEPO improves RL-based automated theorem proving by addressing GRPO’s issues with composite rewards and static sampling through dynamic resampling, decoupled advantage estimation, and extra gradient steps.
Details
Motivation: Current RL approaches like GRPO have two critical issues in ATP: 1) relative advantage estimation conflicts with binary verifier feedback when using composite rewards, and 2) static sampling wastes entire batches when no valid proofs are found, resulting in zero model updates.Method: Group Dual-dynamic and Equal-right-advantage Policy Optimization (GDEPO) with three core mechanisms: 1) dynamic additional sampling (resamples invalid batches until valid proof found), 2) equal-right advantage (decouples advantage sign from magnitude for stable updates), and 3) dynamic additional iterations (extra gradient steps for initially failed but eventually successful samples).
Result: Experiments on three datasets (MinF2F-test, MathOlympiadBench, PutnamBench) confirm GDEPO’s effectiveness, with ablation studies validating the necessity of its synergistic components for enhanced data utilization and optimization efficiency.
Conclusion: GDEPO offers a novel training paradigm for ATP that addresses critical limitations of existing RL approaches, improving data utilization and optimization efficiency through synergistic dynamic mechanisms and proper advantage handling.
Abstract: Automated Theorem Proving (ATP) represents a fundamental challenge in Artificial Intelligence (AI), requiring the construction of machine-verifiable proofs in formal languages such as Lean to evaluate AI reasoning capabilities. Reinforcement learning (RL), particularly the high-performance Group Relative Policy Optimization (GRPO) algorithm, has emerged as a mainstream approach for this task. However, in ATP scenarios, GRPO faces two critical issues: when composite rewards are used, its relative advantage estimation may conflict with the binary feedback from the formal verifier; meanwhile, its static sampling strategy may discard entire batches of data if no valid proof is found, resulting in zero contribution to model updates and significant data waste. To address these limitations, we propose Group Dual-dynamic and Equal-right-advantage Policy Optimization (GDEPO), a method incorporating three core mechanisms: 1) dynamic additional sampling, which resamples invalid batches until a valid proof is discovered; 2) equal-right advantage, decoupling the sign of the advantage function (based on correctness) from its magnitude (modulated by auxiliary rewards) to ensure stable and correct policy updates; and 3) dynamic additional iterations, applying extra gradient steps to initially failed but eventually successful samples to accelerate learning on challenging cases. Experiments conducted on three datasets of varying difficulty (MinF2F-test, MathOlympiadBench, PutnamBench) confirm the effectiveness of GDEPO, while ablation studies validate the necessity of its synergistic components. The proposed method enhances data utilization and optimization efficiency, offering a novel training paradigm for ATP.
[268] Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation
Muhammad Khalifa, Lajanugen Logeswaran, Jaekyeom Kim, Sungryull Sohn, Yunxiang Zhang, Moontae Lee, Hao Peng, Lu Wang, Honglak Lee
Main category: cs.AI
TL;DR: LLM judges evaluating agent performance are highly vulnerable to manipulation of agent reasoning traces, with manipulated reasoning inflating false positive rates by up to 90% across diverse web tasks.
Details
Motivation: Current LLM-based evaluation of agents assumes that chain-of-thought reasoning faithfully reflects internal reasoning and environment state, but this assumption needs testing for robustness against manipulation.Method: Systematically rewrote agent chain-of-thought reasoning while keeping actions and observations fixed, testing both style-based (presentation changes) and content-based (fabricating task progress signals) manipulation strategies across 800 trajectories in diverse web tasks.
Result: Manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90%. Content-based manipulations are consistently more effective than style-based approaches. Prompting techniques and scaling compute reduce but don’t eliminate susceptibility.
Conclusion: LLM-based evaluation has fundamental vulnerability to reasoning trace manipulation, highlighting the need for judging mechanisms that verify reasoning claims against observable evidence rather than trusting reasoning traces at face value.
Abstract: Large language models (LLMs) are increasingly used as judges to evaluate agent performance, particularly in non-verifiable settings where judgments rely on agent trajectories including chain-of-thought (CoT) reasoning. This paradigm implicitly assumes that the agent’s CoT faithfully reflects both its internal reasoning and the underlying environment state. We show this assumption is brittle: LLM judges are highly susceptible to manipulation of agent reasoning traces. By systematically rewriting agent CoTs while holding actions and observations fixed, we demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks. We study manipulation strategies spanning style-based approaches that alter only the presentation of reasoning and content-based approaches that fabricate signals of task progress, and find that content-based manipulations are consistently more effective. We evaluate prompting-based techniques and scaling judge-time compute, which reduce but do not fully eliminate susceptibility to manipulation. Our findings reveal a fundamental vulnerability in LLM-based evaluation and highlight the need for judging mechanisms that verify reasoning claims against observable evidence.
[269] Emerging from Ground: Addressing Intent Deviation in Tool-Using Agents via Deriving Real Calls into Virtual Trajectories
Qian Xiong, Yuekai Huang, Bo Yang, Yujia Zheng, Tianhao Li, Ziyou Jiang, Zhiyuan Chang, Zhaoyang Li, Huanxiang Feng, Mingyang Li
Main category: cs.AI
TL;DR: RISE: A “Real-to-Virtual” method that synthesizes virtual tool-using trajectories anchored on verified tool primitives to mitigate intent deviation in LLM agents, achieving significant improvements in task completion and intent alignment.
Details
Motivation: LLM tool-using agents often exhibit intent deviation (subtle misalignment between user intent and agent behavior), which hinders reliable evaluation and improvement. Existing methods are either costly (real system samples) or suffer from distribution shift (LLM-simulated data), and both lack negative samples for intent deviation scenarios.Method: RISE uses a “Real-to-Virtual” approach that anchors on verified tool primitives to synthesize virtual trajectories. It generates diverse negative samples through mutation of critical parameters, then fine-tunes backbone LLMs via two-stage training for intent alignment using the synthetic data.
Result: RISE achieves 35.28% average improvement in task completion (Acctask) and 23.27% in intent alignment (Accintent), outperforming SOTA baselines by 1.20-42.09% and 1.17-54.93% respectively. The synthesized data performs well across eight metrics covering user requirements, execution trajectories, and agent responses.
Conclusion: RISE effectively addresses intent deviation in LLM tool-using agents by generating high-quality synthetic data anchored to real tool primitives, enabling significant performance improvements through targeted preference learning with diverse negative samples.
Abstract: LLMs have advanced tool-using agents for real-world applications, yet they often lead to unexpected behaviors or results. Beyond obvious failures, the subtle issue of “intent deviation” severely hinders reliable evaluation and performance improvement. Existing post-training methods generally leverage either real system samples or virtual data simulated by LLMs. However, the former is costly due to reliance on hand-crafted user requests, while the latter suffers from distribution shift from the real tools in the wild. Additionally, both methods lack negative samples tailored to intent deviation scenarios, hindering effective guidance on preference learning. We introduce RISE, a “Real-to-Virtual” method designed to mitigate intent deviation. Anchoring on verified tool primitives, RISE synthesizes virtual trajectories and generates diverse negative samples through mutation on critical parameters. With synthetic data, RISE fine-tunes backbone LLMs via the two-stage training for intent alignment. Evaluation results demonstrate that data synthesized by RISE achieve promising results in eight metrics covering user requires, execution trajectories and agent responses. Integrating with training, RISE achieves an average 35.28% improvement in Acctask (task completion) and 23.27% in Accintent (intent alignment), outperforming SOTA baselines by 1.20–42.09% and 1.17–54.93% respectively.
[270] BayesianVLA: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries
Shijie Lian, Bin Yu, Xiaopeng Lin, Laurence T. Yang, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Cong Huang, Kai Chen
Main category: cs.AI
TL;DR: BayesianVLA addresses information collapse in VLA models by enforcing instruction following through Bayesian decomposition and maximizing conditional pointwise mutual information between actions and instructions.
Details
Motivation: Current VLA models struggle with generalization to new instructions and multi-task scenarios due to dataset bias where language instructions become predictable from visual observations alone, causing information collapse where models ignore language constraints.Method: Proposes BayesianVLA framework with learnable Latent Action Queries and dual-branch architecture to estimate both vision-only prior p(a|v) and language-conditioned posterior π(a|v,ℓ), optimizing policy to maximize conditional PMI between actions and instructions.
Result: Significantly improves generalization without requiring new data, achieving 11.3% improvement on challenging OOD SimplerEnv benchmark, with extensive experiments across SimplerEnv and RoboCasa demonstrating substantial gains.
Conclusion: BayesianVLA effectively addresses information collapse in VLA models by penalizing vision shortcuts and rewarding actions that explain language commands, enabling robust language grounding in action for better generalization.
Abstract: Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints and fail in out-of-distribution (OOD) settings. To address this, we propose BayesianVLA, a novel framework that enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior $p(a \mid v)$ and a language-conditioned posterior $π(a \mid v, \ell)$. We then optimize the policy to maximize the conditional Pointwise Mutual Information (PMI) between actions and instructions. This objective effectively penalizes the vision shortcut and rewards actions that explicitly explain the language command. Without requiring new data, BayesianVLA significantly improves generalization. Extensive experiments across on SimplerEnv and RoboCasa demonstrate substantial gains, including an 11.3% improvement on the challenging OOD SimplerEnv benchmark, validating the ability of our approach to robustly ground language in action.
cs.SD
[271] Abusive music and song transformation using GenAI and LLMs
Jiyang Choi, Rohitash Chandra
Main category: cs.SD
TL;DR: GenAI transforms abusive content in music by altering vocal delivery and lyrics, reducing aggression while maintaining artistic integrity.
Details
Motivation: Repeated exposure to violent/abusive music content can normalize aggression and reinforce harmful stereotypes, creating need for content moderation that doesn't trigger "forbidden fruit" effect.Method: Use GenAI and LLMs to automatically transform abusive words and lyrical content in popular music, altering tone, intensity, and sentiment rather than just muting/replacing single words. Comparative analysis of 4 English songs and their transformed versions using acoustic and sentiment analysis.
Result: GenAI significantly reduces vocal aggressiveness: acoustic analysis shows improvements in Harmonic to Noise Ratio, Cepstral Peak Prominence, and Shimmer. Sentiment analysis shows 63.3-85.6% aggression reduction across artists, with up to 88.6% reduction in chorus sections. Transformed versions maintain musical coherence.
Conclusion: GenAI offers promising alternative to traditional content moderation by creating safer listening experiences while preserving artistic expression and avoiding the “forbidden fruit” effect.
Abstract: Repeated exposure to violence and abusive content in music and song content can influence listeners’ emotions and behaviours, potentially normalising aggression or reinforcing harmful stereotypes. In this study, we explore the use of generative artificial intelligence (GenAI) and Large Language Models (LLMs) to automatically transform abusive words (vocal delivery) and lyrical content in popular music. Rather than simply muting or replacing a single word, our approach transforms the tone, intensity, and sentiment, thus not altering just the lyrics, but how it is expressed. We present a comparative analysis of four selected English songs and their transformed counterparts, evaluating changes through both acoustic and sentiment-based lenses. Our findings indicate that Gen-AI significantly reduces vocal aggressiveness, with acoustic analysis showing improvements in Harmonic to Noise Ratio, Cepstral Peak Prominence, and Shimmer. Sentiment analysis reduced aggression by 63.3-85.6% across artists, with major improvements in chorus sections (up to 88.6% reduction). The transformed versions maintained musical coherence while mitigating harmful content, offering a promising alternative to traditional content moderation that avoids triggering the “forbidden fruit” effect, where the censored content becomes more appealing simply because it is restricted. This approach demonstrates the potential for GenAI to create safer listening experiences while preserving artistic expression.
[272] DeepASMR: LLM-Based Zero-Shot ASMR Speech Generation for Anyone of Any Voice
Leying Zhang, Tingxiao Zhou, Haiyang Sun, Mengxiao Bi, Yanmin Qian
Main category: cs.SD
TL;DR: DeepASMR is the first zero-shot ASMR generation framework that can synthesize high-fidelity ASMR speech from just a short snippet of ordinary read-style speech, without needing whispered training data from the target speaker.
Details
Motivation: Current TTS systems excel at read-style speech but fail at generating ASMR (Autonomous Sensory Meridian Response) speech, which requires subtle, low-intensity characteristics and zero-shot speaker adaptation. ASMR's unique qualities make it challenging for conventional TTS approaches.Method: The framework uses discrete speech tokens that factorize ASMR style from speaker timbre. It employs a two-stage pipeline: 1) LLM for content-style encoding, and 2) flow-matching acoustic decoder for timbre reconstruction. The approach enables zero-shot ASMR generation from minimal speaker data.
Result: DeepASMR achieves state-of-the-art naturalness and style fidelity in ASMR generation for any voice, while maintaining competitive performance on normal speech synthesis. The authors also contribute DeepASMR-DB, a 670-hour English-Chinese multi-speaker ASMR corpus, and a novel evaluation protocol.
Conclusion: DeepASMR successfully addresses the challenge of zero-shot ASMR generation by separating style from timbre, enabling high-fidelity ASMR synthesis from minimal speaker data without requiring whispered training samples, representing a significant advancement in specialized speech synthesis.
Abstract: While modern Text-to-Speech (TTS) systems achieve high fidelity for read-style speech, they struggle to generate Autonomous Sensory Meridian Response (ASMR), a specialized, low-intensity speech style essential for relaxation. The inherent challenges include ASMR’s subtle, often unvoiced characteristics and the demand for zero-shot speaker adaptation. In this paper, we introduce DeepASMR, the first framework designed for zero-shot ASMR generation. We demonstrate that a single short snippet of a speaker’s ordinary, read-style speech is sufficient to synthesize high-fidelity ASMR in their voice, eliminating the need for whispered training data from the target speaker. Methodologically, we first identify that discrete speech tokens provide a soft factorization of ASMR style from speaker timbre. Leveraging this insight, we propose a two-stage pipeline incorporating a Large Language Model (LLM) for content-style encoding and a flow-matching acoustic decoder for timbre reconstruction. Furthermore, we contribute DeepASMR-DB, a comprehensive 670-hour English-Chinese multi-speaker ASMR speech corpus, and introduce a novel evaluation protocol integrating objective metrics, human listening tests, LLM-based scoring and unvoiced speech analysis. Extensive experiments confirm that DeepASMR achieves state-of-the-art naturalness and style fidelity in ASMR generation for anyone of any voice, while maintaining competitive performance on normal speech synthesis.
[273] Qwen3-TTS Technical Report
Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, Xinyu Zhang, Pei Zhang, Baosong Yang, Jin Xu, Jingren Zhou, Junyang Lin
Main category: cs.SD
TL;DR: Qwen3-TTS is a multilingual text-to-speech model family featuring 3-second voice cloning, description-based control, dual-track architecture for real-time synthesis, and two specialized tokenizers for different streaming needs.
Details
Motivation: To create advanced multilingual TTS models that combine voice cloning, fine-grained control, robustness, and efficient streaming capabilities while supporting community research through open licensing.Method: Uses dual-track LM architecture with two specialized tokenizers: 1) Qwen-TTS-Tokenizer-25Hz (single-codebook for semantic content, integrates with Qwen-Audio, uses block-wise DiT for streaming), and 2) Qwen-TTS-Tokenizer-12Hz (12.5Hz multi-codebook for extreme bitrate reduction and ultra-low-latency with 97ms first-packet emission via lightweight causal ConvNet). Trained on 5M+ hours of 10-language speech data.
Result: State-of-the-art performance on diverse benchmarks including TTS multilingual test set, InstructTTSEval, and long speech test set. Supports 3-second voice cloning and description-based control for novel voice creation and fine-grained manipulation.
Conclusion: Qwen3-TTS series delivers advanced multilingual, controllable, robust, and streaming TTS capabilities with open-source availability under Apache 2.0 license, enabling both novel voice creation and community research development.
Abstract: In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis, coupled with two speech tokenizers: 1) Qwen-TTS-Tokenizer-25Hz is a single-codebook codec emphasizing semantic content, which offers seamlessly integration with Qwen-Audio and enables streaming waveform reconstruction via a block-wise DiT. 2) Qwen-TTS-Tokenizer-12Hz achieves extreme bitrate reduction and ultra-low-latency streaming, enabling immediate first-packet emission ($97,\mathrm{ms}$) through its 12.5 Hz, 16-layer multi-codebook design and a lightweight causal ConvNet. Extensive experiments indicate state-of-the-art performance across diverse objective and subjective benchmark (e.g., TTS multilingual test set, InstructTTSEval, and our long speech test set). To facilitate community research and development, we release both tokenizers and models under the Apache 2.0 license.
[274] PF-D2M: A Pose-free Diffusion Model for Universal Dance-to-Music Generation
Jaekwon Im, Natalia Polouliakh, Taketo Akama
Main category: cs.SD
TL;DR: PF-D2M is a universal diffusion-based dance-to-music generation model that uses visual features from dance videos and progressive training to overcome data scarcity and generalization issues.
Details
Motivation: Existing dance-to-music generation approaches have limitations: they rely on single human dancer motion features and limited datasets, restricting performance and applicability to real-world scenarios with multiple dancers and non-human dancers.Method: PF-D2M is a universal diffusion-based model that incorporates visual features extracted from dance videos. It uses a progressive training strategy to address data scarcity and generalization challenges.
Result: Both objective and subjective evaluations show that PF-D2M achieves state-of-the-art performance in dance-music alignment and music quality.
Conclusion: PF-D2M successfully addresses limitations of existing approaches by using visual features from dance videos and progressive training, achieving superior performance in generating music aligned with dance movements.
Abstract: Dance-to-music generation aims to generate music that is aligned with dance movements. Existing approaches typically rely on body motion features extracted from a single human dancer and limited dance-to-music datasets, which restrict their performance and applicability to real-world scenarios involving multiple dancers and non-human dancers. In this paper, we propose PF-D2M, a universal diffusion-based dance-to-music generation model that incorporates visual features extracted from dance videos. PF-D2M is trained with a progressive training strategy that effectively addresses data scarcity and generalization challenges. Both objective and subjective evaluations show that PF-D2M achieves state-of-the-art performance in dance-music alignment and music quality.
[275] EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning
Dingdong Wang, Shujie Liu, Tianhua Zhang, Youjun Chen, Jinyu Li, Helen Meng
Main category: cs.SD
TL;DR: EmotionThinker reformulates speech emotion recognition as a deep reasoning problem using RL, generating interpretable explanations grounded in acoustic cues, and outperforms previous SOTA models in both accuracy and explanation quality.
Details
Motivation: Current SpeechLLMs and SER systems treat emotion understanding as simple classification, providing limited interpretability and underutilizing LLMs' expressive and reasoning capabilities. There's a need to move beyond classification to deep reasoning with interpretable explanations.Method: 1) Construct EmotionCoT-35K dataset with Chain-of-Thought annotations and detailed captions; 2) Develop prosody-enhanced foundation model EmotionThinker-Base to address weak prosody perception in current SpeechLLMs; 3) Introduce GRPO-PTR (Group-Relative-Policy-Optimization with Progressive-Trust-aware-Reasoning-Reward) RL framework that progressively introduces reasoning rewards with dynamic trustworthiness weighting.
Result: EmotionThinker outperforms previous state-of-the-art evaluation models both in emotion accuracy and explanation quality. Prosody enhancement improves emotion understanding, and the RL approach advances SER toward interpretable multimodal reasoning.
Conclusion: The work successfully reformulates SER as a deep reasoning problem through RL, creating a system that generates accurate emotion predictions with interpretable explanations grounded in fine-grained acoustic cues, advancing the field toward interpretable multimodal reasoning.
Abstract: Emotional information in speech plays a unique role in multimodal perception. However, current Speech Large Language Models (SpeechLLMs), similar to conventional speech emotion recognition (SER) systems, still treat emotion understanding as a simple classification problem. This provides limited interpretability of predictions, while leaving the LLMs’ expressive and reasoning capabilities underutilized. In this work, we take the first step to reformulate SER as a deep reasoning problem through reinforcement learning (RL). We propose EmotionThinker, which is designed to generate accurate emotion predictions with interpretable explanations grounded in fine-grained acoustic cues. To achieve this, we first construct EmotionCoT-35K, an emotional reasoning dataset with Chain-of-Thought annotations and detailed captions. Second, we observe that current SpeechLLMs exhibit weak prosody perception, whereas prosodic cues constitute fundamental signals for interpreting emotions. To address this, we develop the prosody-enhanced foundation model EmotionThinker-Base, and demonstrate that prosody enhancement improves emotion understanding. Third, we introduce Group-Relative-Policy-Optimization with Progressive-Trust-aware-Reasoning-Reward (GRPO-PTR) for RL. Different from standard GRPO, which relies only on rule-based outcome rewards, GRPO-PTR progressively introduces reasoning reward, dynamically adjusts it with a trustworthiness weight reflecting the alignment between reasoning and outcome, and evaluates the overall reasoning quality with a reward model based on multi-dimensional criteria. EmotionThinker outperforms previous state-of-the-art evaluation models both in emotion accuracy and explanation quality, advancing SER toward interpretable multimodal reasoning. Project page: https://github.com/dingdongwang/EmotionThinker
[276] Bridging the Perception Gap: A Lightweight Coarse-to-Fine Architecture for Edge Audio Systems
Hengfan Zhang, Yueqian Lin, Hai Helen Li, Yiran Chen
Main category: cs.SD
TL;DR: CoFi-Agent is a hybrid edge-cloud architecture that uses local 7B Audio-LLM for fast perception and conditional cloud-triggered forensic refinement with on-device tools, improving accuracy from 27.2% to 53.6% on MMAR benchmark.
Details
Motivation: There's a tension between perception depth and computational efficiency in deploying Audio-LLMs on edge infrastructure. Lightweight local models produce generic summaries missing subtle evidence for multi-step audio reasoning, while cloud offloading causes unacceptable latency, bandwidth costs, and privacy risks.Method: CoFi-Agent performs fast local perception using a 7B Audio-LLM, then a cloud controller gates difficult cases based on uncertainty detection. It issues lightweight plans for on-device tools like temporal re-listening and local ASR for forensic refinement.
Result: On the MMAR benchmark, CoFi-Agent improves accuracy from 27.20% to 53.60%, achieving a better accuracy-efficiency trade-off than an always-on investigation pipeline.
Conclusion: CoFi-Agent bridges the perception gap through tool-enabled, conditional edge-cloud collaboration under practical system constraints, offering an effective solution for audio-language model deployment on edge infrastructure.
Abstract: Deploying Audio-Language Models (Audio-LLMs) on edge infrastructure exposes a persistent tension between perception depth and computational efficiency. Lightweight local models tend to produce passive perception - generic summaries that miss the subtle evidence required for multi-step audio reasoning - while indiscriminate cloud offloading incurs unacceptable latency, bandwidth cost, and privacy risk. We propose CoFi-Agent (Tool-Augmented Coarse-to-Fine Agent), a hybrid architecture targeting edge servers and gateways. It performs fast local perception and triggers conditional forensic refinement only when uncertainty is detected. CoFi-Agent runs an initial single-pass on a local 7B Audio-LLM, then a cloud controller gates difficult cases and issues lightweight plans for on-device tools such as temporal re-listening and local ASR. On the MMAR benchmark, CoFi-Agent improves accuracy from 27.20% to 53.60%, while achieving a better accuracy-efficiency trade-off than an always-on investigation pipeline. Overall, CoFi-Agent bridges the perception gap via tool-enabled, conditional edge-cloud collaboration under practical system constraints.
[277] U3-xi: Pushing the Boundaries of Speaker Recognition via Incorporating Uncertainty
Junjie Li, Kong Aik Lee
Main category: cs.SD
TL;DR: U3-xi framework improves speaker verification by estimating frame-level uncertainty to weight frame contributions, achieving significant performance gains.
Details
Motivation: In real-world scenarios, individual frames in speaker embeddings contain both speaker-relevant information and nuisance factors, causing unequal contributions to final speaker representations. Current methods don't adequately address this frame-level uncertainty.Method: Proposes U3-xi framework with three uncertainty supervision strategies: 1) Speaker-level uncertainty via Stochastic Variance Loss using distance to speaker centroid as pseudo ground truth; 2) Global-level uncertainty by injecting predicted uncertainty into softmax scale for adaptive decision boundary; 3) Transformer encoder with multi-view self-attention for uncertainty estimation.
Result: Model-agnostic framework achieves 21.1% relative improvement in EER and 15.57% in minDCF on VoxCeleb1 test sets when applied to ECAPA-TDNN speaker encoder.
Conclusion: U3-xi effectively addresses frame-level uncertainty in speaker embeddings, producing more reliable and interpretable uncertainty estimates while significantly improving speaker verification performance across different encoder architectures.
Abstract: An utterance-level speaker embedding is typically obtained by aggregating a sequence of frame-level representations. However, in real-world scenarios, individual frames encode not only speaker-relevant information but also various nuisance factors. As a result, different frames contribute unequally to the final utterance-level speaker representation for Automatic Speaker Verification systems. To address this issue, we propose to estimate the inherent uncertainty of each frame and assign adaptive weights accordingly, where frames with higher uncertainty receive lower attention. Based on this idea, we present U3-xi, a comprehensive framework designed to produce more reliable and interpretable uncertainty estimates for speaker embeddings. Specifically, we introduce several strategies for uncertainty supervision. First, we propose speaker-level uncertainty supervision via a Stochastic Variance Loss, where the distance between an utterance embedding and its corresponding speaker centroid serves as a pseudo ground truth for uncertainty learning. Second, we incorporate global-level uncertainty supervision by injecting the predicted uncertainty into the sof tmax scale during training. This adaptive scaling mechanism adjusts the sharpness of the decision boundary according to sample difficulty, providing global guidance. Third, we redesign the uncertainty estimation module by integrating a Transformer encoder with multi-view self-attention, enabling the model to capture rich local and long-range temporal dependencies. Comprehensive experiments demonstrate that U3-xi is model-agnostic and can be seamlessly applied to various speaker encoders. In particular, when applied to ECAPA-TDNN, it achieves 21.1% and 15.57% relative improvements on the VoxCeleb1 test sets in terms of EER and minDCF, respectively.
[278] Distillation-based Layer Dropping (DLD) Effective End-to-end Framework for Dynamic Speech Networks
Abdul Hannan, Daniele Falavigna, Shah Nawaz, Mubashir Noman, Markus Schedl, Alessio Brutti
Main category: cs.SD
TL;DR: DLD framework combines knowledge distillation with layer dropping to create dynamic speech networks that maintain performance across different computational budgets while reducing training time.
Details
Motivation: Edge devices need dynamic architectures that can adapt to varying resource constraints. Existing layer dropping methods degrade performance significantly in both low and high dropping scenarios, compromising the performance-computation trade-off.Method: Proposes distillation-based layer dropping (DLD) framework that integrates knowledge distillation with layer dropping in an end-to-end fashion, enabling dynamic speech networks to maintain performance across different computational budgets.
Result: Achieves state-of-the-art performance for dynamic speech networks, reducing word error rate by 9.32% for high dropping cases and 2.25% for no dropping cases, with 33.3% reduction in training time. Validated on conformer and WavLM models across three public benchmarks.
Conclusion: The DLD framework effectively addresses the limitations of existing layer dropping methods by combining knowledge distillation, achieving better performance-computation trade-offs for dynamic speech networks on edge devices.
Abstract: Edge devices operate in constrained and varying resource settings, requiring dynamic architectures that can adapt to limitations of the available resources. To meet such demands, layer dropping ($\mathcal{LD}$) approach is typically used to transform static models into dynamic ones by skipping parts of the network along with reducing overall computational complexity. However, existing $\mathcal{LD}$ methods greatly impact the dynamic model’s performance for low and high dropping cases, deteriorating the performance-computation trade-off. To this end, we propose a distillation-based layer dropping (DLD) framework that effectively combines the capabilities of knowledge distillation and $\mathcal{LD}$ in an end-to-end fashion, thereby achieving state-of-the-art performance for dynamic speech networks. Comprehensive experimentation utilizing well-known speech recognition methods, including conformer and WavLM, on three public benchmarks demonstrates the effectiveness of our framework, reducing the word error rate by $9.32%$ and $2.25%$ for high and no dropping cases with $33.3%$ reduction in training time.
[279] Pay (Cross) Attention to the Melody: Curriculum Masking for Single-Encoder Melodic Harmonization
Maximos Kaliakatsos-Papakostas, Dimos Makris, Konstantinos Soiledis, Konstantinos-Theodoros Tsamis, Vassilis Katsouros, Emilios Cambouropoulos
Main category: cs.SD
TL;DR: The paper introduces a new training curriculum called FF (full-to-full) for melodic harmonization that improves melody-harmony attention by keeping all harmony tokens masked initially, then progressively unmasking entire sequences during training.
Details
Motivation: Existing single encoder transformer approaches for melodic harmonization suffer from weak attention between melody and harmony due to training curricula inspired by discrete diffusion, leading to limited exploitation of melodic cues, especially in out-of-domain contexts.Method: Proposed FF curriculum: keeps all harmony tokens masked for several training steps, then progressively unmasking entire sequences during training to strengthen melody-harmony interactions. Evaluated across multiple experimental axes including temporal quantization, conditioning methods, melody representations, and inference strategies.
Result: FF curriculum consistently outperforms baselines in nearly all metrics, with strong gains in out-of-domain evaluations. Quarter-note quantization, intertwining bar tokens, and pitch-class melody representations work best in FF setting.
Conclusion: Training curricula are crucial for effective melody conditioning in harmonization, and full-to-full unmasking offers a robust strategy for single encoder harmonization models.
Abstract: Melodic harmonization, the task of generating harmonic accompaniments for a given melody, remains a central challenge in computational music generation. Recent single encoder transformer approaches have framed harmonization as a masked sequence modeling problem, but existing training curricula inspired by discrete diffusion often result in weak (cross) attention between melody and harmony. This leads to limited exploitation of melodic cues, particularly in out-of-domain contexts. In this work, we introduce a training curriculum, FF (full-to-full), which keeps all harmony tokens masked for several training steps before progressively unmasking entire sequences during training to strengthen melody-harmony interactions. We systematically evaluate this approach against prior curricula across multiple experimental axes, including temporal quantization (quarter vs. sixteenth note), bar-level vs. time-signature conditioning, melody representation (full range vs. pitch class), and inference-time unmasking strategies. Models are trained on the HookTheory dataset and evaluated both in-domain and on a curated collection of jazz standards, using a comprehensive set of metrics that assess chord progression structure, harmony-melody alignment, and rhythmic coherence. Results demonstrate that the proposed FF curriculum consistently outperforms baselines in nearly all metrics, with particularly strong gains in out-of-domain evaluations where harmonic adaptability to novel melodic queues is crucial. We further find that quarter-note quantization, intertwining of bar tokens, and pitch-class melody representations are advantageous in the FF setting. Our findings highlight the importance of training curricula in enabling effective melody conditioning and suggest that full-to-full unmasking offers a robust strategy for single encoder harmonization.
[280] Domain-Incremental Continual Learning for Robust and Efficient Keyword Spotting in Resource Constrained Systems
Prakash Dhungana, Sayed Ahmad Salehi
Main category: cs.SD
TL;DR: A continual learning framework for keyword spotting that adapts to noisy environments while maintaining efficiency on edge devices, achieving over 94% accuracy even at -10 dB SNR.
Details
Motivation: Small footprint KWS models on edge devices struggle with accuracy and robustness due to domain shifts from varying noise and recording conditions. Need for adaptation to new domains while maintaining computational efficiency.Method: Comprehensive continual learning framework with: 1) Dual-input CNN using MFCC and Mel-spectrogram features, 2) Multi-stage denoising (wavelet transform + spectral subtraction), 3) Complete quantized model updates (not just specific layers), 4) Runtime sample selection using class prototypes and confidence filtering, 5) Pseudo-labeling combined with rehearsal buffer for incremental retraining.
Result: Achieved 99.63% accuracy on clean data and maintained robust performance exceeding 94% accuracy across diverse noisy environments, even at challenging -10 dB Signal-to-Noise Ratio.
Conclusion: Integrating efficient denoising with prototype-based continual learning enables KWS models to operate autonomously and robustly in resource-constrained, dynamic environments, addressing domain shift challenges effectively.
Abstract: Keyword Spotting (KWS) systems with small footprint models deployed on edge devices face significant accuracy and robustness challenges due to domain shifts caused by varying noise and recording conditions. To address this, we propose a comprehensive framework for continual learning designed to adapt to new domains while maintaining computational efficiency. The proposed pipeline integrates a dual-input Convolutional Neural Network, utilizing both Mel Frequency Cepstral Coefficients (MFCC) and Mel-spectrogram features, supported by a multi-stage denoising process, involving discrete wavelet transform and spectral subtraction techniques, plus model and prototype update blocks. Unlike prior methods that restrict updates to specific layers, our approach updates the complete quantized model, made possible due to compact model architecture. A subset of input samples are selected during runtime using class prototypes and confidence-driven filtering, which are then pseudo-labeled and combined with rehearsal buffer for incremental model retraining. Experimental results on noisy test dataset demonstrate the framework’s effectiveness, achieving 99.63% accuracy on clean data and maintaining robust performance (exceeding 94% accuracy) across diverse noisy environments, even at -10 dB Signal-to-Noise Ratio. The proposed framework work confirms that integrating efficient denoising with prototype-based continual learning enables KWS models to operate autonomously and robustly in resource-constrained, dynamic environments.
[281] Toward Efficient Speech Emotion Recognition via Spectral Learning and Attention
HyeYoung Lee, Muhammad Nadeem
Main category: cs.SD
TL;DR: A novel 1D-CNN-based Speech Emotion Recognition framework using MFCC features with channel/spatial attention mechanisms achieves state-of-the-art accuracy across multiple datasets.
Details
Motivation: Traditional SER methods struggle to capture subtle emotional variations and generalize across diverse datasets. The authors aim to bridge computational emotion processing with human auditory perception using MFCC features and improve robustness through data augmentation.Method: Proposes a 1D-CNN-based SER framework with MFCC spectral features, data augmentation techniques, and enhanced attention mechanisms (channel and spatial attention) to highlight key emotional patterns in speech signals.
Result: Achieves cutting-edge performance: 97.49% (SAVEE), 99.23% (RAVDESS), 89.31% (CREMA-D), 99.82% (TESS), 99.53% (EMO-DB), and 96.39% (EMOVO). Sets new benchmarks in SER with high precision across diverse datasets.
Conclusion: Integration of advanced deep learning methods with attention mechanisms substantially enhances generalization across diverse datasets, demonstrating strong potential for real-world deployment in assistive technologies and human-computer interaction.
Abstract: Speech Emotion Recognition (SER) traditionally relies on auditory data analysis for emotion classification. Several studies have adopted different methods for SER. However, existing SER methods often struggle to capture subtle emotional variations and generalize across diverse datasets. In this article, we use Mel-Frequency Cepstral Coefficients (MFCCs) as spectral features to bridge the gap between computational emotion processing and human auditory perception. To further improve robustness and feature diversity, we propose a novel 1D-CNN-based SER framework that integrates data augmentation techniques. MFCC features extracted from the augmented data are processed using a 1D Convolutional Neural Network (CNN) architecture enhanced with channel and spatial attention mechanisms. These attention modules allow the model to highlight key emotional patterns, enhancing its ability to capture subtle variations in speech signals. The proposed method delivers cutting-edge performance, achieving the accuracy of 97.49% for SAVEE, 99.23% for RAVDESS, 89.31% for CREMA-D, 99.82% for TESS, 99.53% for EMO-DB, and 96.39% for EMOVO. Experimental results show new benchmarks in SER, demonstrating the effectiveness of our approach in recognizing emotional expressions with high precision. Our evaluation demonstrates that the integration of advanced Deep Learning (DL) methods substantially enhances generalization across diverse datasets, underscoring their potential to advance SER for real-world deployment in assistive technologies and human-computer interaction.
[282] Xi+: Uncertainty Supervision for Robust Speaker Embedding
Junjie Li, Kong Aik Lee, Duc-Tuan Truong, Tianchi Liu, Man-Wai Mak
Main category: cs.SD
TL;DR: xi+ improves speaker recognition by adding temporal attention and explicit uncertainty supervision, achieving ~10-11% performance gains over xi-vector.
Details
Motivation: Current xi-vector models have limitations: uncertainty estimation is implicitly trained through classification loss alone without considering temporal relationships between frames, leading to suboptimal supervision for frame importance weighting.Method: Propose xi+ architecture with two key improvements: 1) Temporal attention module to capture frame-level uncertainty in context-aware manner, 2) Novel Stochastic Variance Loss to explicitly supervise uncertainty learning.
Result: Consistent performance improvements of about 10% on VoxCeleb1-O set and 11% on NIST SRE 2024 evaluation set compared to baseline xi-vector.
Conclusion: Explicit uncertainty supervision and temporal context modeling significantly improve speaker recognition performance by better estimating frame importance for utterance-level representations.
Abstract: There are various factors that can influence the performance of speaker recognition systems, such as emotion, language and other speaker-related or context-related variations. Since individual speech frames do not contribute equally to the utterance-level representation, it is essential to estimate the importance or reliability of each frame. The xi-vector model addresses this by assigning different weights to frames based on uncertainty estimation. However, its uncertainty estimation model is implicitly trained through classification loss alone and does not consider the temporal relationships between frames, which may lead to suboptimal supervision. In this paper, we propose an improved architecture, xi+. Compared to xi-vector, xi+ incorporates a temporal attention module to capture frame-level uncertainty in a context-aware manner. In addition, we introduce a novel loss function, Stochastic Variance Loss, which explicitly supervises the learning of uncertainty. Results demonstrate consistent performance improvements of about 10% on the VoxCeleb1-O set and 11% on the NIST SRE 2024 evaluation set.
[283] Competitive Audio-Language Models with Data-Efficient Single-Stage Training on Public Data
Gokul Karthik Kumar, Rishabh Saraf, Ludovick Lepauloux, Abdul Muneer, Billel Mokeddem, Hakim Hacid
Main category: cs.SD
TL;DR: Falcon3-Audio is a family of efficient audio-language models that achieve state-of-the-art performance using minimal data and single-stage training, challenging conventional complex architectures.
Details
Motivation: Despite LLMs transforming NLP, audio integration remains underexplored despite audio's importance in human communication. There's a need for efficient audio-language models that don't require massive datasets or complex architectures.Method: Built Falcon3-Audio family using instruction-tuned LLMs and Whisper encoders. Used remarkably small public audio data (<30K hours, 5K unique). Employed single-stage training without curriculum learning, multiple encoders, or complex cross-attention connectors.
Result: Falcon3-Audio-7B matches best open-weight models on MMAU benchmark (score 64.14, matching R1-AQA) with superior data/parameter efficiency. Even the 1B model competes with larger 2B-13B open models. Extensive ablations show complex architectures unnecessary for strong performance.
Conclusion: Simple, efficient audio-language models can achieve state-of-the-art performance without complex architectures or massive datasets, challenging conventional approaches in the field.
Abstract: Large language models (LLMs) have transformed NLP, yet their integration with audio remains underexplored despite audio’s centrality to human communication. We introduce Falcon3-Audio, a family of Audio-Language Models (ALMs) built on instruction-tuned LLMs and Whisper encoders. Using a remarkably small amount of public audio data, less than 30K hours (5K unique), Falcon3-Audio-7B matches the best reported performance among open-weight models on the MMAU benchmark, with a score of 64.14, matching R1-AQA, while distinguishing itself through superior data and parameter efficiency, single-stage training, and transparency. Notably, our smallest 1B model remains competitive with larger open models ranging from 2B to 13B parameters. Through extensive ablations, we find that common complexities such as curriculum learning, multiple audio encoders, and intricate cross-attention connectors are not required for strong performance, even compared to models trained on over 500K hours of data.
[284] Behind the Scenes: Mechanistic Interpretability of LoRA-adapted Whisper for Speech Emotion Recognition
Yujian Ma, Xikun Lu, Jinqiu Sang, Xianquan Jiang, Ruizhe Li
Main category: cs.SD
TL;DR: This paper conducts the first systematic mechanistic interpretability study of LoRA adaptation in Whisper speech models for emotion recognition, revealing delayed specialization and forward-backward dynamics.
Details
Motivation: Large pre-trained speech models like Whisper have strong generalization but are challenging to adapt efficiently. While LoRA has become popular for parameter-efficient fine-tuning, its underlying mechanisms in speech tasks remain poorly understood, creating a need for systematic interpretability analysis.Method: The study uses a suite of analytical tools including layer contribution probing, logit-lens inspection, and representational similarity analysis via singular value decomposition (SVD) and centered kernel alignment (CKA) to examine LoRA adaptation in the Whisper encoder for speech emotion recognition.
Result: The analysis reveals two key mechanisms: 1) a delayed specialization process where early layers preserve general features before consolidating task-specific information, and 2) a forward alignment, backward differentiation dynamic between LoRA’s matrices, clarifying how LoRA reshapes encoder hierarchies.
Conclusion: The study provides empirical insights and deeper mechanistic understanding for designing efficient and interpretable adaptation strategies in large speech models, with code made publicly available for reproducibility.
Abstract: Large pre-trained speech models such as Whisper offer strong generalization but pose significant challenges for resource-efficient adaptation. Low-Rank Adaptation (LoRA) has become a popular parameter-efficient fine-tuning method, yet its underlying mechanisms in speech tasks remain poorly understood. In this work, we conduct the first systematic mechanistic interpretability study of LoRA within the Whisper encoder for speech emotion recognition (SER). Using a suite of analytical tools, including layer contribution probing, logit-lens inspection, and representational similarity via singular value decomposition (SVD) and centered kernel alignment (CKA), we reveal two key mechanisms: a delayed specialization process that preserves general features in early layers before consolidating task-specific information, and a forward alignment, backward differentiation dynamic between LoRA’s matrices. Our findings clarify how LoRA reshapes encoder hierarchies, providing both empirical insights and a deeper mechanistic understanding for designing efficient and interpretable adaptation strategies in large speech models. Our code is available at https://github.com/harryporry77/Behind-the-Scenes.
[285] AudioMotionBench: Evaluating Auditory Motion Perception in Audio LLMs
Zhe Sun, Yujun Cai, Jiayu Yao, Yiwei Wang
Main category: cs.SD
TL;DR: Current Audio-Language Models (LALMs) have a systematic motion perception deficit - they struggle to perceive spatial dynamics and motion of sound sources, with accuracy below 50% on auditory motion understanding tasks.
Details
Motivation: While LALMs show impressive progress in speech recognition and audio captioning, it remains unclear whether they can perceive spatial dynamics like motion of sound sources. The paper aims to investigate this gap in auditory motion understanding.Method: The authors introduce AudioMotionBench, the first benchmark explicitly designed to evaluate auditory motion understanding. It’s a controlled question-answering benchmark that evaluates whether LALMs can infer direction and trajectory of moving sound sources from binaural audio.
Result: Comprehensive analyses reveal that current models struggle to reliably recognize motion cues or distinguish directional patterns, with average accuracy remaining below 50%. This demonstrates a fundamental limitation in auditory spatial reasoning.
Conclusion: The study highlights a fundamental gap between human and model auditory spatial reasoning, providing both a diagnostic tool (AudioMotionBench) and new insight for enhancing spatial cognition in future Audio-Language Models.
Abstract: Large Audio-Language Models (LALMs) have recently shown impressive progress in speech recognition, audio captioning, and auditory question answering. Yet, whether these models can perceive spatial dynamics, particularly the motion of sound sources, remains unclear. In this work, we uncover a systematic motion perception deficit in current ALLMs. To investigate this issue, we introduce AudioMotionBench, the first benchmark explicitly designed to evaluate auditory motion understanding. AudioMotionBench introduces a controlled question-answering benchmark designed to evaluate whether Audio-Language Models (LALMs) can infer the direction and trajectory of moving sound sources from binaural audio. Comprehensive quantitative and qualitative analyses reveal that current models struggle to reliably recognize motion cues or distinguish directional patterns. The average accuracy remains below 50%, underscoring a fundamental limitation in auditory spatial reasoning. Our study highlights a fundamental gap between human and model auditory spatial reasoning, providing both a diagnostic tool and new insight for enhancing spatial cognition in future Audio-Language Models.
[286] Mitigation of multi-path propagation artefacts in acoustic targets with adaptive cepstral filtering
Lucas C. F. Domingos, Russell S. A. Brinkworth, Paulo E. Santos, Karl Sammut
Main category: cs.SD
TL;DR: A cepstral filtering method using adaptive band-stop filters improves acoustic target signal separation from reflections in multi-path environments, enhancing SNR and classification performance.
Details
Motivation: Passive acoustic sensing for monitoring moving targets is hindered by multi-path reflections and motion artifacts. Existing filtering techniques don't properly incorporate environmental characteristics or account for medium property variability, limiting their ability to separate source and reflection components.Method: Proposes temporal filtering applied to cepstral coefficients using an adaptive band-stop filter that dynamically adjusts its bandwidth based on the relative intensity of quefrency components. This separates target signals from their reflections in spectrograms.
Result: Improved SNR and log-spectral distance across velocities from 10-100 m/s in aircraft noise with simulated motion. Enhanced ship-type classification performance by 2.28 and 2.62 MCC percentage points for DeepShip and VTUAD v2 datasets respectively.
Conclusion: The method demonstrates potential to improve acoustic target classification and time-delay estimation in multi-path environments. Future work focuses on amplitude preservation and multi-sensor applications.
Abstract: Passive acoustic sensing is a cost-effective solution for monitoring moving targets such as vessels and aircraft, but its performance is hindered by complex propagation effects like multi-path reflections and motion-induced artefacts. Existing filtering techniques do not properly incorporate the characteristics of the environment or account for variability in medium properties, limiting their effectiveness in separating source and reflection components. This paper proposes a method for separating target signals from their reflections in a spectrogram. Temporal filtering is applied to cepstral coefficients using an adaptive band-stop filter, which dynamically adjusts its bandwidth based on the relative intensity of the quefrency components. The method improved the signal-to-noise ratio (SNR) and log-spectral distance (LSD) across velocities ranging from 10 to 100 metres per second in aircraft noise with simulated motion. It also enhanced the performance of ship-type classification in underwater tasks by 2.28 and 2.62 Matthews Correlation Coefficient percentage points for the DeepShip and VTUAD v2 datasets, respectively. These results demonstrate the potential of the proposed pipeline to improve acoustic target classification and time-delay estimation in multi-path environments, with future work aimed at amplitude preservation and multi-sensor applications.
[287] Lightweight and perceptually-guided voice conversion for electro-laryngeal speech
Benedikt Mayrhofer, Franz Pernkopf, Philipp Aichinger, Martin Hagmüller
Main category: cs.SD
TL;DR: Lightweight adaptation of StreamVC framework improves electro-laryngeal speech by removing pitch/energy modules and using self-supervised pretraining with supervised fine-tuning on parallel EL/healthy speech data, guided by perceptual and intelligibility losses.
Details
Motivation: Electro-laryngeal speech suffers from constant pitch, limited prosody, and mechanical noise, which reduces naturalness and intelligibility, creating a need for voice rehabilitation solutions.Method: Adapted StreamVC framework by removing pitch and energy modules, combined self-supervised pretraining with supervised fine-tuning on parallel EL and healthy speech data, using perceptual and intelligibility losses for guidance.
Result: Best model variant (+WavLM+HF) drastically reduces character error rate, raises naturalness MOS from 1.1 to 3.3, and consistently narrows the gap to healthy ground-truth speech across all evaluated metrics.
Conclusion: Demonstrates feasibility of adapting lightweight voice conversion architectures for EL voice rehabilitation, while identifying prosody generation and intelligibility improvements as remaining bottlenecks.
Abstract: Electro-laryngeal (EL) speech is characterized by constant pitch, limited prosody, and mechanical noise, reducing naturalness and intelligibility. We propose a lightweight adaptation of the state-of-the-art StreamVC framework to this setting by removing pitch and energy modules and combining self-supervised pretraining with supervised fine-tuning on parallel EL and healthy (HE) speech data, guided by perceptual and intelligibility losses. Objective and subjective evaluations across different loss configurations confirm their influence: the best model variant, based on WavLM features and human-feedback predictions (+WavLM+HF), drastically reduces character error rate (CER) of EL inputs, raises naturalness mean opinion score (nMOS) from 1.1 to 3.3, and consistently narrows the gap to HE ground-truth speech in all evaluated metrics. These findings demonstrate the feasibility of adapting lightweight voice conversion architectures to EL voice rehabilitation while also identifying prosody generation and intelligibility improvements as the main remaining bottlenecks.
[288] Multi-Task Transformer for Explainable Speech Deepfake Detection via Formant Modeling
Viola Negroni, Luca Cuccovillo, Paolo Bestagini, Patrick Aichroth, Stefano Tubaro
Main category: cs.SD
TL;DR: A multi-task transformer for speech deepfake detection that predicts formant trajectories and voicing patterns while classifying speech as real/fake, with built-in explainability highlighting voiced/unvoiced region importance.
Details
Motivation: To create a more efficient and interpretable speech deepfake detection system that not only classifies audio but also provides insights into which acoustic features (voiced vs unvoiced regions) drive the decisions.Method: Builds on prior speaker-formant transformer architecture with improvements: streamlined input segmentation strategy, redesigned decoding process, and integrated explainability mechanisms. The multi-task approach simultaneously predicts formant trajectories, voicing patterns, and binary classification.
Result: The model requires fewer parameters, trains faster, provides better interpretability, and maintains comparable prediction performance to the baseline.
Conclusion: The proposed multi-task transformer offers an efficient and explainable solution for speech deepfake detection, balancing performance with interpretability while reducing computational requirements.
Abstract: In this work, we introduce a multi-task transformer for speech deepfake detection, capable of predicting formant trajectories and voicing patterns over time, ultimately classifying speech as real or fake, and highlighting whether its decisions rely more on voiced or unvoiced regions. Building on a prior speaker-formant transformer architecture, we streamline the model with an improved input segmentation strategy, redesign the decoding process, and integrate built-in explainability. Compared to the baseline, our model requires fewer parameters, trains faster, and provides better interpretability, without sacrificing prediction performance.
[289] WavLink: Compact Audio-Text Embeddings with a Global Whisper Token
Gokul Karthik Kumar, Ludovick Lepauloux, Hakim Hacid
Main category: cs.SD
TL;DR: WavLink is a compact audio-text embedding model that enhances Whisper encoder with a learnable global token, achieving state-of-the-art retrieval performance through systematic design optimization and two-stage training with Matryoshka-style supervision.
Details
Motivation: While Whisper is widely used as an audio encoder in large audio-language models, audio-text embedding models (like CLAP-based models) haven't effectively leveraged Whisper, instead relying on alternative encoders like HTS-AT and PaSST. There's an opportunity to create a more effective audio-text embedding model using Whisper's capabilities.Method: WavLink augments the Whisper encoder with a learnable global token and trains it jointly with a text encoder. The authors conduct systematic studies of design choices including pretrained text encoders, loss functions, training modes, and data mixtures. They use a two-stage training recipe across three model sizes with Matryoshka-style supervision to enable scalable embeddings.
Result: The model achieves state-of-the-art retrieval performance, enables 8x smaller embeddings with minimal performance drop through Matryoshka-style supervision, and demonstrates competitive performance on AIR-Bench with MCQs and zero-shot classification tasks.
Conclusion: WavLink successfully leverages Whisper for audio-text embedding tasks, demonstrating that systematic design optimization and two-stage training with Matryoshka-style supervision can create compact, high-performing audio-text embedding models that scale effectively.
Abstract: Whisper has become the de-facto encoder for extracting general-purpose audio features in large audio-language models, where a 30-second clip is typically represented by 1500 frame features projected into an LLM. In contrast, audio-text embedding models like CLAP-based models have largely relied on alternative audio encoders (e.g., HTS-AT, PaSST), and have not leveraged Whisper effectively. We present WavLink, a compact audio-text embedding model that augments Whisper encoder with a learnable global token, trained jointly with a text encoder. Through a systematic study of design choices, including pretrained text encoders, loss functions, training modes, and data mixtures, we identify configurations that yield state-of-the-art retrieval performance. Our two-stage training recipe across three model sizes, combined with Matryoshka-style supervision, improves scalability, enabling 8x smaller embeddings with minimal performance drop. WavLink also demonstrates competitive performance on AIR-Bench with MCQs and zero-shot classification.
cs.LG
[290] Empowering LLMs for Structure-Based Drug Design via Exploration-Augmented Latent Inference
Xuanning Hu, Anchen Li, Qianli Xing, Jinglong Ji, Hao Tuo, Bo Yang
Main category: cs.LG
TL;DR: ELILLM enhances LLMs for drug design by treating generation as encoding, latent exploration, and decoding, using Bayesian optimization for systematic exploration and knowledge-guided decoding for chemical validity.
Details
Motivation: LLMs have strong capabilities but are limited in structure-based drug design due to insufficient protein structure understanding and unpredictable molecular generation.Method: ELILLM reinterprets LLM generation as encoding, latent space exploration, and decoding workflow. Uses Bayesian optimization for systematic exploration of latent embeddings, position-aware surrogate model for binding affinity prediction, and knowledge-guided decoding for chemical validity.
Result: Demonstrated on CrossDocked2020 benchmark, showing strong controlled exploration and high binding affinity scores compared with seven baseline methods.
Conclusion: ELILLM effectively enhances LLMs capabilities for structure-based drug design by addressing their limitations in protein structure understanding and molecular generation.
Abstract: Large Language Models (LLMs) possess strong representation and reasoning capabilities, but their application to structure-based drug design (SBDD) is limited by insufficient understanding of protein structures and unpredictable molecular generation. To address these challenges, we propose Exploration-Augmented Latent Inference for LLMs (ELILLM), a framework that reinterprets the LLM generation process as an encoding, latent space exploration, and decoding workflow. ELILLM explicitly explores portions of the design problem beyond the model’s current knowledge while using a decoding module to handle familiar regions, generating chemically valid and synthetically reasonable molecules. In our implementation, Bayesian optimization guides the systematic exploration of latent embeddings, and a position-aware surrogate model efficiently predicts binding affinity distributions to inform the search. Knowledge-guided decoding further reduces randomness and effectively imposes chemical validity constraints. We demonstrate ELILLM on the CrossDocked2020 benchmark, showing strong controlled exploration and high binding affinity scores compared with seven baseline methods. These results demonstrate that ELILLM can effectively enhance LLMs capabilities for SBDD.
[291] Language Models Entangle Language and Culture
Shourya Jain, Paras Chopra
Main category: cs.LG
TL;DR: LLMs provide lower quality answers in low-resource languages and language choice significantly impacts cultural context in responses, affecting answer quality.
Details
Motivation: Users should not be disadvantaged by their language choice when interacting with LLMs; there should be equitable response quality across languages regardless of which language is used for queries.Method: Created real-world open-ended questions from WildChat dataset analysis to evaluate response quality variations by language. Used LLM-as-a-Judge to identify cultural context in responses. Evaluated LLMs on translated subset of CulturalBench benchmark across multiple languages.
Result: LLMs consistently provide lower quality answers to open-ended questions in low-resource languages. Language significantly impacts the cultural context used by models, and this difference in context affects downstream answer quality.
Conclusion: There is systematic disadvantage for users of low-resource languages in LLM interactions, with language choice affecting both answer quality and cultural context, highlighting the need for more equitable multilingual AI systems.
Abstract: Users should not be systemically disadvantaged by the language they use for interacting with LLMs; i.e. users across languages should get responses of similar quality irrespective of language used. In this work, we create a set of real-world open-ended questions based on our analysis of the WildChat dataset and use it to evaluate whether responses vary by language, specifically, whether answer quality depends on the language used to query the model. We also investigate how language and culture are entangled in LLMs such that choice of language changes the cultural information and context used in the response by using LLM-as-a-Judge to identify the cultural context present in responses. To further investigate this, we evaluate LLMs on a translated subset of the CulturalBench benchmark across multiple languages. Our evaluations reveal that LLMs consistently provide lower quality answers to open-ended questions in low resource languages. We find that language significantly impacts the cultural context used by the model. This difference in context impacts the quality of the downstream answer.
[292] Improving MoE Compute Efficiency by Composing Weight and Data Sparsity
Maciej Kilian, Oleg Mkrtchyan, Luke Zettlemoyer, Akshat Shrivastava, Armen Aghajanyan
Main category: cs.LG
TL;DR: The paper introduces a method to achieve data sparsity in causal MoE models by using null experts, improving compute efficiency for vision-language tasks without violating causality.
Details
Motivation: Current MoE layers achieve compute efficiency through weight sparsity (each token activates few experts), but data sparsity (each expert processes few tokens) offers complementary benefits. However, existing data sparsity methods like expert-choice routing violate causality in autoregressive models, creating train-inference mismatch.Method: The authors recover data sparsity within causal token-choice MoE by introducing zero-compute (null) experts in the routing pool. When tokens route to null experts, those slots consume no compute. The standard load balancing objective trains the model to uniformly use all experts (real and null), creating data sparsity in expectation without causality violations.
Result: At matched expected FLOPs, combining weight and data sparsity yields better compute efficiency than weight sparsity alone, with improvements in training loss and downstream performance. The model learns implicit modality-aware allocation, routing vision tokens to null experts more aggressively than text tokens, without explicit modality routing.
Conclusion: Null experts enable data sparsity in causal MoE models, addressing the train-inference mismatch of expert-choice routing. This approach is particularly effective for vision-language models where data heterogeneity is pronounced, allowing implicit modality-aware routing and improved compute efficiency.
Abstract: Mixture-of-Experts layers achieve compute efficiency through weight sparsity: each token activates only a subset of experts. Data sparsity, where each expert processes only a subset of tokens, offers a complementary axis. Expert-choice routing implements data sparsity directly but violates causality in autoregressive models, creating train-inference mismatch. We recover data sparsity within causal token-choice MoE by leveraging zero-compute (null) experts within the routing pool. When a token routes to null experts, those slots consume no compute. The standard load balancing objective trains the model to uniformly use all experts (real and null) therefore creating data sparsity in expectation without the causality violations. We evaluate on vision-language model training, where data heterogeneity is pronounced: vision encoders produce many low-information tokens while text tokens are denser. At matched expected FLOPs, composing weight and data sparsity yields a more compute-efficient frontier than weight sparsity alone, with gains in training loss and downstream performance. The model learns implicit modality-aware allocation, routing vision tokens to null experts more aggressively than text, without explicit modality routing.
[293] You Need Better Attention Priors
Elon Litman, Gabe Guo
Main category: cs.LG
TL;DR: GOAT is a new attention mechanism that replaces standard attention’s implicit uniform prior with a learnable prior based on Entropic Optimal Transport, improving length generalization and addressing attention sinks.
Details
Motivation: Standard attention has limitations: it assumes an implicit uniform prior in the transport problem, suffers from attention sinks (tokens that disproportionately attract attention), and has representational trade-offs. The paper aims to provide a more flexible attention mechanism with better length generalization.Method: The authors view attention through Entropic Optimal Transport (EOT) and introduce Generalized Optimal transport Attention with Trainable priors (GOAT). GOAT replaces the naive uniform prior with a learnable, continuous prior that maintains compatibility with optimized kernels like FlashAttention. It also absorbs spatial information into core attention computation.
Result: GOAT provides an EOT-based explanation of attention sinks and offers a solution that avoids representational trade-offs. It learns an extrapolatable prior that combines the flexibility of learned positional embeddings with the length generalization of fixed encodings.
Conclusion: GOAT generalizes the attention mechanism by framing it as an Entropic Optimal Transport problem with learnable priors, addressing key limitations of standard attention while maintaining computational efficiency and improving length generalization capabilities.
Abstract: We generalize the attention mechanism by viewing it through the lens of Entropic Optimal Transport, revealing that standard attention corresponds to a transport problem regularized by an implicit uniform prior. We introduce Generalized Optimal transport Attention with Trainable priors (GOAT), a new attention mechanism that replaces this naive assumption with a learnable, continuous prior. This prior maintains full compatibility with optimized kernels such as FlashAttention. GOAT also provides an EOT-based explanation of attention sinks and materializes a solution for them, avoiding the representational trade-offs of standard attention. Finally, by absorbing spatial information into the core attention computation, GOAT learns an extrapolatable prior that combines the flexibility of learned positional embeddings with the length generalization of fixed encodings.
[294] FedUMM: A General Framework for Federated Learning with Unified Multimodal Models
Zhaolong Su, Leheng Zhao, Xiaoying Wu, Ziyue Xu, Jindong Wang
Main category: cs.LG
TL;DR: FedUMM is a federated learning framework for unified multimodal models that enables privacy-preserving training across distributed clients with non-IID multimodal data while maintaining low communication costs through adapter-only fine-tuning.
Details
Motivation: Current unified multimodal models (UMMs) are trained in centralized settings, which limits deployment in privacy-sensitive and geographically distributed scenarios due to data privacy concerns and the need to gather all data in a central server.Method: FedUMM uses federated learning with parameter-efficient fine-tuning: clients train lightweight LoRA adapters while freezing the foundation model (BLIP3o backbone), and the server aggregates only adapter updates. Built on NVIDIA FLARE, it handles non-IID multimodal data with Dirichlet-controlled heterogeneity across up to 16 clients.
Result: Results show slight performance degradation as client count and heterogeneity increase, but FedUMM remains competitive with centralized training. Adapter-only federation reduces per-round communication by over an order of magnitude compared to full fine-tuning, enabling practical federated UMM training.
Conclusion: FedUMM provides an effective framework for privacy-preserving federated training of unified multimodal models, demonstrating practical feasibility through adapter-only fine-tuning that significantly reduces communication overhead while maintaining competitive performance.
Abstract: Unified multimodal models (UMMs) are emerging as strong foundation models that can do both generation and understanding tasks in a single architecture. However, they are typically trained in centralized settings where all training and downstream datasets are gathered in a central server, limiting the deployment in privacy-sensitive and geographically distributed scenarios. In this paper, we present FedUMM, a general federated learning framework for UMMs under non-IID multimodal data with low communication cost. Built on NVIDIA FLARE, FedUMM instantiates federation for a BLIP3o backbone via parameter-efficient fine-tuning: clients train lightweight LoRA adapters while freezing the foundation models, and the server aggregates only adapter updates. We evaluate on VQA v2 and the GenEval compositional generation benchmarks under Dirichlet-controlled heterogeneity with up to 16 clients. Results show slight degradation as client count and heterogeneity increase, while remaining competitive with centralized training. We further analyze computation–communication trade-offs and demonstrate that adapter-only federation reduces per-round communication by over an order of magnitude compared to full fine-tuning, enabling practical federated UMM training. This work provides empirical experience for future research on privacy-preserving federated unified multimodal models.
[295] Attention-Informed Surrogates for Navigating Power-Performance Trade-offs in HPC
Ashna Nawar Ahmed, Banooqa Banday, Terry Jones, Tanzima Z. Islam
Main category: cs.LG
TL;DR: A surrogate-assisted multi-objective Bayesian optimization framework using attention-based job embeddings to optimize HPC scheduling decisions for runtime-power trade-offs.
Details
Motivation: HPC schedulers need to balance user performance with resource constraints by selecting optimal node counts for jobs, which is a complex decision requiring automation.Method: Uses surrogate-assisted multi-objective Bayesian optimization with attention-based embeddings of job telemetry to capture performance dynamics, paired with intelligent sample acquisition for data efficiency.
Result: On two production HPC datasets, the embedding-informed method consistently identified higher-quality Pareto fronts of runtime-power trade-offs compared to baselines, with intelligent sampling reducing training costs and improving result stability.
Conclusion: This is the first successful application of embedding-informed surrogates in a MOBO framework to HPC scheduling, jointly optimizing for performance and power on production workloads.
Abstract: High-Performance Computing (HPC) schedulers must balance user performance with facility-wide resource constraints. The task boils down to selecting the optimal number of nodes for a given job. We present a surrogate-assisted multi-objective Bayesian optimization (MOBO) framework to automate this complex decision. Our core hypothesis is that surrogate models informed by attention-based embeddings of job telemetry can capture performance dynamics more effectively than standard regression techniques. We pair this with an intelligent sample acquisition strategy to ensure the approach is data-efficient. On two production HPC datasets, our embedding-informed method consistently identified higher-quality Pareto fronts of runtime-power trade-offs compared to baselines. Furthermore, our intelligent data sampling strategy drastically reduced training costs while improving the stability of the results. To our knowledge, this is the first work to successfully apply embedding-informed surrogates in a MOBO framework to the HPC scheduling problem, jointly optimizing for performance and power on production workloads.
[296] Ambient Dataloops: Generative Models for Dataset Refinement
Adrián Rodríguez-Muñoz, William Daspit, Adam Klivans, Antonio Torralba, Constantinos Daskalakis, Giannis Daras
Main category: cs.LG
TL;DR: Ambient Dataloops is an iterative dataset refinement framework that improves diffusion model training by progressively enhancing dataset quality through a dataset-model co-evolution process with noise-level reduction.
Details
Motivation: Modern datasets contain samples of varying quality, and training diffusion models directly on heterogeneous data yields suboptimal results. There's a need for a systematic approach to improve dataset quality during training.Method: An iterative dataset-model co-evolution framework where: 1) datasets become progressively higher quality each iteration, 2) synthetically improved samples are treated as noisy but at slightly lower noise levels than previous iterations, 3) Ambient Diffusion techniques are used for learning under corruption to avoid destructive self-consuming loops.
Result: Achieves state-of-the-art performance in unconditional and text-conditional image generation and de novo protein design. Provides theoretical justification for the data looping procedure benefits.
Conclusion: Ambient Dataloops effectively refines datasets through iterative co-evolution with models, enabling better learning of underlying data distributions and superior performance across multiple domains while avoiding destructive training loops.
Abstract: We propose Ambient Dataloops, an iterative framework for refining datasets that makes it easier for diffusion models to learn the underlying data distribution. Modern datasets contain samples of highly varying quality, and training directly on such heterogeneous data often yields suboptimal models. We propose a dataset-model co-evolution process; at each iteration of our method, the dataset becomes progressively higher quality, and the model improves accordingly. To avoid destructive self-consuming loops, at each generation, we treat the synthetically improved samples as noisy, but at a slightly lower noisy level than the previous iteration, and we use Ambient Diffusion techniques for learning under corruption. Empirically, Ambient Dataloops achieve state-of-the-art performance in unconditional and text-conditional image generation and de novo protein design. We further provide a theoretical justification for the proposed framework that captures the benefits of the data looping procedure.
[297] Lattice: A Confidence-Gated Hybrid System for Uncertainty-Aware Sequential Prediction with Behavioral Archetypes
Lorian Bannis
Main category: cs.LG
TL;DR: Lattice is a hybrid sequential prediction system using binary confidence gating to conditionally activate learned behavioral archetypes, improving performance when patterns apply while preventing false activation during distribution shifts.
Details
Motivation: To create a robust sequential prediction system that can manage epistemic uncertainty in safety-critical applications by conditionally activating learned behavioral structure only when confident, preventing false pattern application during distribution shifts.Method: Clusters behavior windows into behavioral archetypes and uses binary confidence gating to activate archetype-based scoring only when confidence exceeds a threshold, falling back to baseline predictions when uncertain. Validated on recommendation systems (MovieLens), scientific time-series (LIGO), and financial markets using LSTM and transformer backbones.
Result: On MovieLens with LSTM: +31.9% improvement over LSTM baseline in HR@10, outperforming transformer baselines by 109.4% over SASRec and 218.6% over BERT4Rec. On LIGO/financial data: correctly refuses archetype activation during distribution shift. On transformer backbones: 0.0% improvement (neutral, no degradation).
Conclusion: Confidence gating is a promising architectural principle for managing epistemic uncertainty in safety-critical applications, enabling systems to activate when patterns apply, refuse when they don’t, and defer when redundant.
Abstract: We introduce Lattice, a hybrid sequential prediction system that conditionally activates learned behavioral structure using binary confidence gating. The system clusters behavior windows into behavioral archetypes and uses binary confidence gating to activate archetype-based scoring only when confidence exceeds a threshold, falling back to baseline predictions when uncertain. We validate Lattice on recommendation systems (MovieLens), scientific time-series (LIGO), and financial markets, using LSTM and transformer backbones. On MovieLens with LSTM, Lattice achieves +31.9% improvement over LSTM baseline in HR@10 (p < 3.29 x 10^-25, 30 seeds), outperforming transformer baselines by 109.4% over SASRec and 218.6% over BERT4Rec. On LIGO and financial data, the system correctly refuses archetype activation when distribution shift occurs - a successful outcome demonstrating confidence gating prevents false activation. On transformer backbones, Lattice provides 0.0% improvement (neutral, no degradation), gracefully deferring when structure is already present. This bidirectional validation - activating when patterns apply, refusing when they don’t, and deferring when redundant - supports confidence gating as a promising architectural principle for managing epistemic uncertainty in safety-critical applications.
[298] CASL: Concept-Aligned Sparse Latents for Interpreting Diffusion Models
Zhenghao He, Guangzhi Xiong, Boyang Wang, Sanchit Sinha, Aidong Zhang
Main category: cs.LG
TL;DR: CASL is a supervised framework that aligns sparse latent dimensions in diffusion models with semantic concepts, enabling interpretable semantic control over image generation.
Details
Motivation: Existing SAE-based methods for understanding diffusion models are unsupervised and fail to align sparse features with human-understandable concepts, limiting reliable semantic control over generated images.Method: Trains a Sparse Autoencoder on frozen U-Net activations to get disentangled latents, then learns a lightweight linear mapping that associates each concept with relevant latent dimensions. Uses CASL-Steer for controlled latent intervention as a causal probe, and introduces Editing Precision Ratio (EPR) metric.
Result: Achieves superior editing precision and interpretability compared to existing approaches. First work to achieve supervised alignment between latent representations and semantic concepts in diffusion models.
Conclusion: CASL enables interpretable semantic control over diffusion models by aligning sparse latent dimensions with human-understandable concepts through supervised learning, providing better understanding and control over generated content.
Abstract: Internal activations of diffusion models encode rich semantic information, but interpreting such representations remains challenging. While Sparse Autoencoders (SAEs) have shown promise in disentangling latent representations, existing SAE-based methods for diffusion model understanding rely on unsupervised approaches that fail to align sparse features with human-understandable concepts. This limits their ability to provide reliable semantic control over generated images. We introduce CASL (Concept-Aligned Sparse Latents), a supervised framework that aligns sparse latent dimensions of diffusion models with semantic concepts. CASL first trains an SAE on frozen U-Net activations to obtain disentangled latent representations, and then learns a lightweight linear mapping that associates each concept with a small set of relevant latent dimensions. To validate the semantic meaning of these aligned directions, we propose CASL-Steer, a controlled latent intervention that shifts activations along the learned concept axis. Unlike editing methods, CASL-Steer is used solely as a causal probe to reveal how concept-aligned latents influence generated content. We further introduce the Editing Precision Ratio (EPR), a metric that jointly measures concept specificity and the preservation of unrelated attributes. Experiments show that our method achieves superior editing precision and interpretability compared to existing approaches. To the best of our knowledge, this is the first work to achieve supervised alignment between latent representations and semantic concepts in diffusion models.
[299] Learning from Synthetic Data: Limitations of ERM
Kareem Amin, Alex Bie, Weiwei Kong, Umar Syed, Sergei Vassilvitskii
Main category: cs.LG
TL;DR: The paper analyzes learning theory problems when training data contains both natural and LLM-generated synthetic content, showing ERM has limitations but specialized algorithms can overcome contamination.
Details
Motivation: With the rise of cheap LLMs, synthetic content now contaminates natural data across many domains (reviews, legal documents, etc.). This creates fundamental learning theory challenges where algorithms must learn from mixed natural/synthetic data without knowing which examples are synthetic.Method: Models the problem as sequence learning tasks with mixed natural/synthetic data where algorithms are oblivious to data origin. Analyzes ERM performance for mean estimation and PAC learning, comparing it to algorithms using non-uniform weighting across data generations.
Result: For mean estimation, ERM converges to true mean but is outperformed by algorithms with non-uniform weighting. For PAC learning, ERM sometimes fails to converge to true concept (echoing model collapse), but specialized algorithms can learn correct hypothesis for arbitrary VC classes despite contamination.
Conclusion: Standard ERM has limitations in synthetic data contamination settings, but carefully designed algorithms can overcome these challenges and learn effectively from mixed natural/synthetic data.
Abstract: The prevalence and low cost of LLMs have led to a rise of synthetic content. From review sites to court documents, ``natural’’ content has been contaminated by data points that appear similar to natural data, but are in fact LLM-generated. In this work we revisit fundamental learning theory questions in this, now ubiquitous, setting. We model this scenario as a sequence of learning tasks where the input is a mix of natural and synthetic data, and the learning algorithms are oblivious to the origin of any individual example. We study the possibilities and limitations of ERM in this setting. For the problem of estimating the mean of an arbitrary $d$-dimensional distribution, we find that while ERM converges to the true mean, it is outperformed by an algorithm that assigns non-uniform weights to examples from different generations of data. For the PAC learning setting, the disparity is even more stark. We find that ERM does not always converge to the true concept, echoing the model collapse literature. However, we show there are algorithms capable of learning the correct hypothesis for arbitrary VC classes and arbitrary amounts of contamination.
[300] Panther: Faster and Cheaper Computations with Randomized Numerical Linear Algebra
Fahd Seddik, Abdulrahman Elbedewy, Gaser Sami, Mohamed Abdelmoniem, Yahia Zakaria
Main category: cs.LG
TL;DR: Panther is a PyTorch-compatible library that implements Randomized Numerical Linear Algebra (RandNLA) algorithms to compress deep learning models, achieving up to 75% memory savings on BERT with minimal code changes.
Details
Motivation: Training modern deep learning models is constrained by GPU memory and compute limits, and while RandNLA offers proven compression techniques, there's no unified production-grade library preventing widespread adoption.Method: Panther consolidates established RandNLA algorithms into a single high-performance framework with drop-in replacements for standard components (sketched linear layers, 2D convolution, multi-head attention, randomized matrix decompositions) and implements a custom C++/CUDA backend (pawX) optimized for both CPUs and GPUs.
Result: By replacing standard PyTorch linear layers with Panther layers (requiring only a few lines of code), the library achieves significant memory savings (up to 75%) on BERT while maintaining comparable loss.
Conclusion: Panther demonstrates the effectiveness of RandNLA techniques and provides an easy-to-adopt solution for memory-efficient deep learning, with source code available under MIT License.
Abstract: Training modern deep learning models is increasingly constrained by GPU memory and compute limits. While Randomized Numerical Linear Algebra (RandNLA) offers proven techniques to compress these models, the lack of a unified, production-grade library prevents widely adopting these methods. We present Panther, a PyTorch-compatible library that consolidates established RandNLA algorithms into a single high-performance framework. Panther engineers efficient, drop-in replacements for standard components including sketched linear layers, 2D convolution, multi-head attention, and randomized matrix decompositions (such as pivoted CholeskyQR). By implementing a custom C++/CUDA backend (pawX), Panther provides an optimized implementation that can run on both CPUs and GPUs. We demonstrate the effectiveness of RandNLA techniques and Panther’s ease of adoption. By replacing standard PyTorch linear layers with Panther layers (requiring only a few lines of code) we achieve significant memory savings (up to 75%) on BERT while maintaining comparable loss. Source code is available (MIT License) at https://github.com/FahdSeddik/panther, along with demonstration video at https://youtu.be/7M3RQb4KWxs.
[301] Multi-Targeted Graph Backdoor Attack
Md Nabi Newaz Khan, Abdullah Arafat Miah, Yu Bi
Main category: cs.LG
TL;DR: First multi-targeted backdoor attack for graph classification using subgraph injection instead of replacement, achieving high attack success rates across multiple targets with minimal clean accuracy impact.
Details
Motivation: Existing graph classification backdoor attacks are limited to single-target attacks using subgraph replacement. There's a need to explore multi-targeted attacks that can simultaneously redirect predictions to different target labels while preserving original graph structures.Method: Proposes subgraph injection (instead of replacement) to poison clean graphs while preserving original structure. Multiple triggers are simultaneously implanted to redirect predictions to different target labels. Investigates various attack design parameters including injection methods, connection numbers, trigger sizes, edge density, and poisoning ratios.
Result: Achieves high attack success rates for all target labels with minimal impact on clean accuracy. Outperforms conventional subgraph replacement-based attacks across five datasets. Generalizes effectively across four different GNN architectures regardless of training parameters. Demonstrates robustness against state-of-the-art defenses (randomized smoothing and fine-pruning).
Conclusion: This work highlights GNN vulnerability to multi-targeted backdoor attacks in graph classification. The proposed subgraph injection approach enables effective multi-target poisoning while preserving graph structure, posing significant security concerns for GNN applications.
Abstract: Graph neural network (GNN) have demonstrated exceptional performance in solving critical problems across diverse domains yet remain susceptible to backdoor attacks. Existing studies on backdoor attack for graph classification are limited to single target attack using subgraph replacement based mechanism where the attacker implants only one trigger into the GNN model. In this paper, we introduce the first multi-targeted backdoor attack for graph classification task, where multiple triggers simultaneously redirect predictions to different target labels. Instead of subgraph replacement, we propose subgraph injection which preserves the structure of the original graphs while poisoning the clean graphs. Extensive experiments demonstrate the efficacy of our approach, where our attack achieves high attack success rates for all target labels with minimal impact on the clean accuracy. Experimental results on five dataset demonstrate the superior performance of our attack framework compared to the conventional subgraph replacement-based attack. Our analysis on four GNN models confirms the generalization capability of our attack which is effective regardless of the GNN model architectures and training parameters settings. We further investigate the impact of the attack design parameters including injection methods, number of connections, trigger sizes, trigger edge density and poisoning ratios. Additionally, our evaluation against state-of-the-art defenses (randomized smoothing and fine-pruning) demonstrates the robustness of our proposed multi-target attacks. This work highlights the GNN vulnerability against multi-targeted backdoor attack in graph classification task. Our source codes will be available at https://github.com/SiSL-URI/Multi-Targeted-Graph-Backdoor-Attack.
[302] Early predicting of hospital admission using machine learning algorithms: Priority queues approach
Jakub Antczak, James Montgomery, Małgorzata O’Reilly, Zbigniew Palmowski, Richard Turner
Main category: cs.LG
TL;DR: This study compares SARIMAX, XGBoost, and LSTM models for forecasting emergency department arrivals, showing XGBoost performs best for total admissions while SARIMAX excels for complex cases, though all models struggle with sudden demand surges.
Details
Motivation: Emergency Department overcrowding compromises patient safety and operational efficiency, requiring accurate demand forecasting for effective resource allocation.Method: The study evaluates SARIMAX, XGBoost, and LSTM models for 7-day ED arrival forecasting using Australian hospital data (2017-2021). It decomposes demand into 8 ward categories and stratifies by clinical complexity. Prophet model generates synthetic counterfactual values to address COVID-19 data distortions.
Result: All three models outperform seasonal naive baseline. XGBoost achieved highest accuracy for total daily admissions (MAE=6.63), while SARIMAX performed best for major complexity cases (MAE=3.77). All models successfully reproduce regular patterns but underestimate sudden demand surges.
Conclusion: The techniques effectively forecast regular ED patterns, with XGBoost and SARIMAX showing specific strengths for different forecasting tasks, but all share limitations in predicting infrequent demand surges.
Abstract: Emergency Department overcrowding is a critical issue that compromises patient safety and operational efficiency, necessitating accurate demand forecasting for effective resource allocation. This study evaluates and compares three distinct predictive models: Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors (SARIMAX), EXtreme Gradient Boosting (XGBoost) and Long Short-Term Memory (LSTM) networks for forecasting daily ED arrivals over a seven-day horizon. Utilizing data from an Australian tertiary referral hospital spanning January 2017 to December 2021, this research distinguishes itself by decomposing demand into eight specific ward categories and stratifying patients by clinical complexity. To address data distortions caused by the COVID-19 pandemic, the study employs the Prophet model to generate synthetic counterfactual values for the anomalous period. Experimental results demonstrate that all three proposed models consistently outperform a seasonal naive baseline. XGBoost demonstrated the highest accuracy for predicting total daily admissions with a Mean Absolute Error of 6.63, while the statistical SARIMAX model proved marginally superior for forecasting major complexity cases with an MAE of 3.77. The study concludes that while these techniques successfully reproduce regular day-to-day patterns, they share a common limitation in underestimating sudden, infrequent surges in patient volume.
[303] Martingale Foresight Sampling: A Principled Approach to Inference-Time LLM Decoding
Huayu Li, ZhengXiao He, Siyuan Tian, Jinghao Wen, Ao Li
Main category: cs.LG
TL;DR: MFS is a new decoding method for LLMs that uses Martingale theory to improve reasoning by modeling reasoning paths as stochastic processes, replacing heuristic search with principled probability theory.
Details
Motivation: Standard autoregressive decoding in LLMs is short-sighted and fails to find globally optimal reasoning paths. Existing inference-time strategies use ad-hoc heuristics for path valuation and pruning, lacking theoretical grounding.Method: Reformulates LLM decoding as identifying an optimal stochastic process. Uses Martingale theory: Doob Decomposition Theorem for step valuation, Optional Stopping Theory for path selection, and Martingale Convergence Theorem for adaptive stopping.
Result: Experiments on six reasoning benchmarks show MFS surpasses state-of-the-art methods in accuracy while significantly improving computational efficiency.
Conclusion: MFS provides a principled, theoretically-grounded alternative to heuristic decoding methods, demonstrating both improved accuracy and efficiency in reasoning tasks.
Abstract: Standard autoregressive decoding in large language models (LLMs) is inherently short-sighted, often failing to find globally optimal reasoning paths due to its token-by-token generation process. While inference-time strategies like foresight sampling attempt to mitigate this by simulating future steps, they typically rely on ad-hoc heuristics for valuing paths and pruning the search space. This paper introduces Martingale Foresight Sampling (MFS), a principled framework that reformulates LLM decoding as a problem of identifying an optimal stochastic process. By modeling the quality of a reasoning path as a stochastic process, we leverage Martingale theory to design a theoretically-grounded algorithm. Our approach replaces heuristic mechanisms with principles from probability theory: step valuation is derived from the Doob Decomposition Theorem to measure a path’s predictable advantage, path selection uses Optional Stopping Theory for principled pruning of suboptimal candidates, and an adaptive stopping rule based on the Martingale Convergence Theorem terminates exploration once a path’s quality has provably converged. Experiments on six reasoning benchmarks demonstrate that MFS surpasses state-of-the-art methods in accuracy while significantly improving computational efficiency. Code will be released at https://github.com/miraclehetech/EACL2026-Martingale-Foresight-Sampling.
[304] MARS: Unleashing the Power of Speculative Decoding via Margin-Aware Verification
Jingwei Song, Xinyu Wang, Hanbin Wang, Xiaoxuan Lei, Bill Shi, Shixin Han, Eric Yang, Xiao-Wen Chang, Lynn Ai
Main category: cs.LG
TL;DR: Margin-Aware Speculative Verification improves speculative decoding by adapting verification to the target model’s decisiveness, relaxing strict token rejection when it provides minimal benefit, leading to faster inference without quality loss.
Details
Motivation: Current speculative decoding verification relies on strict token-level rejection sampling, which is inefficient when target models have weak preferences among top candidates. Rejecting plausible runner-up tokens yields negligible information gain while incurring substantial rollback costs.Method: Proposes Margin-Aware Speculative Verification that conditions verification on decision stability measured from target logits. It relaxes rejection only when strict verification provides minimal benefit, adapting to the target model’s local decisiveness without training.
Result: Extensive experiments across model scales (8B to 235B) show consistent and significant inference speedups over state-of-the-art baselines while preserving generation quality across diverse benchmarks.
Conclusion: The proposed verification strategy is training-free, domain-agnostic, fully compatible with existing target-coupled speculative decoding frameworks, and addresses fundamental inefficiencies in current verification mechanisms.
Abstract: Speculative Decoding (SD) accelerates autoregressive large language model (LLM) inference by decoupling generation and verification. While recent methods improve draft quality by tightly coupling the drafter with the target model, the verification mechanism itself remains largely unchanged, relying on strict token-level rejection sampling. In practice, modern LLMs frequently operate in low-margin regimes where the target model exhibits weak preference among top candidates. In such cases, rejecting plausible runner-up tokens yields negligible information gain while incurring substantial rollback cost, leading to a fundamental inefficiency in verification. We propose Margin-Aware Speculative Verification, a training-free and domain-agnostic verification strategy that adapts to the target model’s local decisiveness. Our method conditions verification on decision stability measured directly from the target logits and relaxes rejection only when strict verification provides minimal benefit. Importantly, the approach modifies only the verification rule and is fully compatible with existing target-coupled speculative decoding frameworks. Extensive experiments across model scales ranging from 8B to 235B demonstrate that our method delivers consistent and significant inference speedups over state-of-the-art baselines while preserving generation quality across diverse benchmarks.
[305] Data-driven Lake Water Quality Forecasting for Time Series with Missing Data using Machine Learning
Rishit Chatterjee, Tahiya Chowdhury
Main category: cs.LG
TL;DR: Ridge regression best for forecasting Secchi Disk Depth in lakes; requires only ~64 recent samples and 1 predictor to stay within 5% of full-history accuracy, enabling efficient volunteer monitoring.
Details
Motivation: Volunteer-led lake monitoring produces irregular, seasonal time series with many gaps due to ice cover, weather constraints, and human errors, making harmful algal bloom forecasting and early warning difficult.Method: Used 30-lake dataset from three decades of Maine lake records; handled missing data with Multiple Imputation by Chained Equations (MICE); evaluated six models using normalized Mean Absolute Error (nMAE); ridge regression performed best; developed joint feasibility function to determine minimal training history and predictors.
Result: Ridge regression provided best mean test performance; model reaches within 5% of full-history accuracy with ~176 training samples per lake; compact four-feature subset matches thirteen-feature baseline; joint feasibility analysis shows only ~64 recent samples and one predictor needed per lake to stay within 5% accuracy target.
Conclusion: Joint feasibility strategy unifies recent-history length and feature choice under fixed accuracy target, providing simple, efficient rule for setting sampling effort and measurement priorities for lake researchers, making targeted monitoring practical.
Abstract: Volunteer-led lake monitoring yields irregular, seasonal time series with many gaps arising from ice cover, weather-related access constraints, and occasional human errors, complicating forecasting and early warning of harmful algal blooms. We study Secchi Disk Depth (SDD) forecasting on a 30-lake, data-rich subset drawn from three decades of in situ records collected across Maine lakes. Missingness is handled via Multiple Imputation by Chained Equations (MICE), and we evaluate performance with a normalized Mean Absolute Error (nMAE) metric for cross-lake comparability. Among six candidates, ridge regression provides the best mean test performance. Using ridge regression, we then quantify the minimal sample size, showing that under a backward, recent-history protocol, the model reaches within 5% of full-history accuracy with approximately 176 training samples per lake on average. We also identify a minimal feature set, where a compact four-feature subset matches the thirteen-feature baseline within the same 5% tolerance. Bringing these results together, we introduce a joint feasibility function that identifies the minimal training history and fewest predictors sufficient to achieve the target of staying within 5% of the complete-history, full-feature baseline. In our study, meeting the 5% accuracy target required about 64 recent samples and just one predictor per lake, highlighting the practicality of targeted monitoring. Hence, our joint feasibility strategy unifies recent-history length and feature choice under a fixed accuracy target, yielding a simple, efficient rule for setting sampling effort and measurement priorities for lake researchers.
[306] SAGE-FM: A lightweight and interpretable spatial transcriptomics foundation model
Xianghao Zhan, Jingyu Xu, Yuanning Zheng, Zinaida Good, Olivier Gevaert
Main category: cs.LG
TL;DR: SAGE-FM is a lightweight spatial transcriptomics foundation model using graph convolutional networks with masked central spot prediction, trained on 416 human Visium samples across 15 organs, achieving strong performance in gene recovery, clustering, and downstream tasks.
Details
Motivation: Spatial transcriptomics enables spatial gene expression profiling, motivating computational models that capture spatially conditioned regulatory relationships. Current methods may lack spatial awareness or biological interpretability.Method: SAGE-FM uses graph convolutional networks (GCNs) trained with a masked central spot prediction objective. The model is lightweight and parameter-efficient, trained on 416 human Visium samples spanning 15 organs.
Result: SAGE-FM learns spatially coherent embeddings that robustly recover masked genes (91% with significant correlations p<0.05). It outperforms MOFA and existing methods in unsupervised clustering and biological heterogeneity preservation. Achieves 81% accuracy in pathologist-defined spot annotation and improves glioblastoma subtype prediction. In silico perturbations show capture of directional ligand-receptor and regulatory effects.
Conclusion: Simple, parameter-efficient GCNs can serve as biologically interpretable and spatially aware foundation models for large-scale spatial transcriptomics, demonstrating strong performance across multiple tasks.
Abstract: Spatial transcriptomics enables spatial gene expression profiling, motivating computational models that capture spatially conditioned regulatory relationships. We introduce SAGE-FM, a lightweight spatial transcriptomics foundation model based on graph convolutional networks (GCNs) trained with a masked central spot prediction objective. Trained on 416 human Visium samples spanning 15 organs, SAGE-FM learns spatially coherent embeddings that robustly recover masked genes, with 91% of masked genes showing significant correlations (p < 0.05). The embeddings generated by SAGE-FM outperform MOFA and existing spatial transcriptomics methods in unsupervised clustering and preservation of biological heterogeneity. SAGE-FM generalizes to downstream tasks, enabling 81% accuracy in pathologist-defined spot annotation in oropharyngeal squamous cell carcinoma and improving glioblastoma subtype prediction relative to MOFA. In silico perturbation experiments further demonstrate that the model captures directional ligand-receptor and upstream-downstream regulatory effects consistent with ground truth. These results demonstrate that simple, parameter-efficient GCNs can serve as biologically interpretable and spatially aware foundation models for large-scale spatial transcriptomics.
[307] Machine learning-enhanced non-amnestic Alzheimer’s disease diagnosis from MRI and clinical features
Megan A. Witherow, Michael L. Evans, Ahmed Temtam, Hamid Okhravi, Khan M. Iftekharuddin
Main category: cs.LG
TL;DR: ML approach improves diagnosis of atypical Alzheimer’s disease using clinical tests and MRI features, outperforming hippocampal volume alone.
Details
Motivation: Atypical Alzheimer's disease (atAD) patients are often misdiagnosed in clinical settings because standard diagnostic methods (cognitive tests and hippocampal atrophy evaluation) work well for typical AD but fail for atypical presentations.Method: Developed machine learning approach using clinical testing battery and MRI data from 1410 subjects across four groups. Used multiple classification experiments comparing clinical features, hippocampal volume, and comprehensive MRI features from across the brain. Applied Boruta statistical approach to identify significant brain regions.
Result: Best performance achieved by incorporating additional important MRI features beyond hippocampal volume alone. Improved recall of atAD diagnosis from 52% to 69% for NACC data and from 34% to 77% for ADNI data while maintaining high precision.
Conclusion: The proposed ML approach significantly improves diagnostic accuracy for atypical Alzheimer’s disease using only standard clinical testing and MRI data, with important implications for clinical practice.
Abstract: Alzheimer’s disease (AD), defined as an abnormal buildup of amyloid plaques and tau tangles in the brain can be diagnosed with high accuracy based on protein biomarkers via PET or CSF analysis. However, due to the invasive nature of biomarker collection, most AD diagnoses are made in memory clinics using cognitive tests and evaluation of hippocampal atrophy based on MRI. While clinical assessment and hippocampal volume show high diagnostic accuracy for amnestic or typical AD (tAD), a substantial subgroup of AD patients with atypical presentation (atAD) are routinely misdiagnosed. To improve diagnosis of atAD patients, we propose a machine learning approach to distinguish between atAD and non-AD cognitive impairment using clinical testing battery and MRI data collected as standard-of-care. We develop and evaluate our approach using 1410 subjects across four groups (273 tAD, 184 atAD, 235 non-AD, and 685 cognitively normal) collected from one private data set and two public data sets from the National Alzheimer’s Coordinating Center (NACC) and the Alzheimer’s Disease Neuroimaging Initiative (ADNI). We perform multiple atAD vs. non-AD classification experiments using clinical features and hippocampal volume as well as a comprehensive set of MRI features from across the brain. The best performance is achieved by incorporating additional important MRI features, which outperforms using hippocampal volume alone. Furthermore, we use the Boruta statistical approach to identify and visualize significant brain regions distinguishing between diagnostic groups. Our ML approach improves the percentage of correctly diagnosed atAD cases (the recall) from 52% to 69% for NACC and from 34% to 77% for ADNI, while achieving high precision. The proposed approach has important implications for improving diagnostic accuracy for non-amnestic atAD in clinical settings using only clinical testing battery and MRI.
[308] QUAIL: Quantization Aware Unlearning for Mitigating Misinformation in LLMs
Himanshu Mishra, Kanwal Mehreen
Main category: cs.LG
TL;DR: Quantization can restore forgotten information in machine unlearning; proposed quantization-aware unlearning with logits hinge loss preserves forgetting under 4-bit quantization.
Details
Motivation: Machine unlearning aims to remove specific knowledge from trained models, but quantization (e.g., 4-bit) used for deployment can catastrophically restore forgotten information, undermining the purpose of unlearning.Method: Analyzed weight-change statistics and bucket overlaps in quantization to show typical unlearning updates are too small to cross quantization thresholds. Introduced logits space hinge loss that forces output logits of unlearned model to differ from original model by at least half the quantization step for forget examples.
Result: Method preserves forgetting under 4-bit quantization on language and classification tasks (including Twitter misinformation dataset), while existing methods almost entirely recover forgotten knowledge after quantization.
Conclusion: Quantization-aware unlearning with logits hinge loss effectively maintains forgetting capability under low-bit quantization, addressing the critical issue of information restoration during model deployment.
Abstract: Machine unlearning aims to remove specific knowledge (e.g., copyrighted or private data) from a trained model without full retraining. In practice, models are often quantized (e.g., 4-bit) for deployment, but we find that quantization can catastrophically restore forgotten information [1]. In this paper, we (1) analyze why low-bit quantization undermines unlearning, and (2) propose a quantization-aware unlearning method to mitigate this. We first compute weight-change statistics and bucket overlaps in quantization to show that typical unlearning updates are too small to cross quantization thresholds. Building on this insight, we introduce a logits space hinge loss: for each forget example, we force the output logits of the unlearned model to differ from the original model by at least a margin (half the quantization step). This ensures forgotten examples remain distinguishable even after quantization. We evaluate on language and classification tasks (including a Twitter misinformation dataset) and show our method preserves forgetting under 4-bit quantization, whereas existing methods almost entirely recover the forgotten knowledge.
[309] PRISM: Deriving the Transformer as a Signal-Denoising Operator via Maximum Coding Rate Reduction
Dongchen Huang
Main category: cs.LG
TL;DR: Prism is a white-box attention architecture derived from MCR² principles that uses geometric constraints to induce unsupervised functional disentanglement, showing that interpretability and performance can be unified rather than traded off.
Details
Motivation: Transformers are criticized as "black boxes" lacking interpretability. The authors aim to create a more interpretable attention-based architecture that doesn't sacrifice performance.Method: Propose Prism architecture based on Maximizing Coding Rate Reduction (MCR²) principles. Model attention as gradient ascent on signal-noise manifold. Introduce two physical constraints: 1) overcomplete dictionary for expanded representational phase space, 2) irrational frequency separation (π-RoPE) to enforce incoherence between signal and noise subspaces.
Result: On TinyStories testbed, Prism spontaneously specializes attention heads into spectrally distinct regimes: low-frequency heads capture long-range causal dependencies (signal) while high-frequency heads handle local syntactic constraints (noise).
Conclusion: Interpretability and performance are not a trade-off but can be unified through principled geometric construction. Geometric inductive biases alone can induce unsupervised functional disentanglement.
Abstract: Deep learning models, particularly Transformers, are often criticized as “black boxes” and lack interpretability. We propose Prism, a white-box attention-based architecture derived from the principles of Maximizing Coding Rate Reduction ($\text{MCR}^2$). By modeling the attention mechanism as a gradient ascent process on a distinct signal-noise manifold, we introduce two physical constraints: an overcomplete dictionary to expand the representational phase space, and an irrational frequency separation ($π$-RoPE) to enforce incoherence between signal and noise subspaces. We demonstrate that these geometric inductive biases can be viewed as a physical constraint and they are sufficient to induce unsupervised functional disentanglement alone. Using TinyStories as a controlled testbed for verifying spectral dynamics, we observe that Prism spontaneously specializes its attention heads into spectrally distinct regimes: low-frequency heads capturing long-range causal dependencies (signal) and high-frequency heads handling local syntactic constraints (noise). Our results suggest that interpretability and performance are not a trade-off, but can be unified through principled geometric construction.
[310] RDumb++: Drift-Aware Continual Test-Time Adaptation
Himanshu Mishra
Main category: cs.LG
TL;DR: RDumb++ extends RDumb with entropy and KL-divergence drift detection mechanisms and adaptive reset strategies to prevent prediction collapse in long-horizon Continual Test-Time Adaptation, achieving ~3% accuracy gains on CCC benchmark.
Details
Motivation: Existing CTTA methods like Tent and EATA struggle with rapidly changing or long-horizon test distribution shifts, particularly in benchmarks like CCC with 7.5M samples and continually evolving corruptions. There's a need for methods that can detect harmful adaptation and recover before prediction collapse occurs.Method: RDumb++ introduces two drift-detection mechanisms: entropy-based drift scoring and KL-divergence drift scoring, combined with adaptive reset strategies. These mechanisms allow the model to detect when accumulated adaptation becomes harmful and trigger recovery before prediction collapse.
Result: On CCC-medium benchmark with three speeds and three seeds (nine runs, each containing one million samples), RDumb++ consistently surpasses RDumb, achieving approximately 3% absolute accuracy gains while maintaining stable adaptation throughout the entire stream.
Conclusion: Drift-aware resetting is essential for preventing collapse and achieving reliable long-horizon CTTA. RDumb++ demonstrates that principled drift detection and adaptive reset strategies enable stable adaptation over extremely long data streams with continually changing distributions.
Abstract: Continual Test-Time Adaptation (CTTA) seeks to update a pretrained model during deployment using only the incoming, unlabeled data stream. Although prior approaches such as Tent, EATA etc. provide meaningful improvements under short evolving shifts, they struggle when the test distribution changes rapidly or over extremely long horizons. This challenge is exemplified by the CCC benchmark, where models operate over streams of 7.5M samples with continually changing corruption types and severities. We propose RDumb++, a principled extension of RDumb that introduces two drift-detection mechanisms i.e entropy-based drift scoring and KL-divergence drift scoring, together with adaptive reset strategies. These mechanisms allow the model to detect when accumulated adaptation becomes harmful and to recover before prediction collapse occurs. Across CCC-medium with three speeds and three seeds (nine runs, each containing one million samples), RDumb++ consistently surpasses RDumb, yielding approx 3% absolute accuracy gains while maintaining stable adaptation throughout the entire stream. Ablation experiments on drift thresholds and reset strengths further show that drift-aware resetting is essential for preventing collapse and achieving reliable long-horizon CTTA.
[311] Beyond validation loss: Clinically-tailored optimization metrics improve a model’s clinical performance
Charles B. Delahunt, Courosh Mehanian, Daniel E. Shea, Matthew P. Horning
Main category: cs.LG
TL;DR: Using clinically-tailored metrics instead of validation loss for model optimization in healthcare ML yields better clinical performance.
Details
Motivation: Traditional ML uses validation loss for optimization, but healthcare ML has distinct clinical requirements that may not align with training loss functions. Clinical requirements can be better captured by tailored metrics.Method: Conducted two controlled experiments comparing model optimization using clinically-tailored metrics versus validation loss. Used metrics specifically designed to reflect clinical requirements rather than differentiable loss functions.
Result: Clinically-tailored metrics provided superior model optimization compared to validation loss, resulting in better performance on the clinical task.
Conclusion: While requiring extra effort to define and implement, using clinically-relevant metrics for optimization yields models that better meet healthcare ML’s central goal of strong clinical performance.
Abstract: A key task in ML is to optimize models at various stages, e.g. by choosing hyperparameters or picking a stopping point. A traditional ML approach is to use validation loss, i.e. to apply the training loss function on a validation set to guide these optimizations. However, ML for healthcare has a distinct goal from traditional ML: Models must perform well relative to specific clinical requirements, vs. relative to the loss function used for training. These clinical requirements can be captured more precisely by tailored metrics. Since many optimization tasks do not require the driving metric to be differentiable, they allow a wider range of options, including the use of metrics tailored to be clinically-relevant. In this paper we describe two controlled experiments which show how the use of clinically-tailored metrics provide superior model optimization compared to validation loss, in the sense of better performance on the clinical task. The use of clinically-relevant metrics for optimization entails some extra effort, to define the metrics and to code them into the pipeline. But it can yield models that better meet the central goal of ML for healthcare: strong performance in the clinic.
[312] Learning Neural Operators from Partial Observations via Latent Autoregressive Modeling
Jingren Hou, Hong Wang, Pengyu Xu, Chang Gao, Huafeng Liu, Liping Jing
Main category: cs.LG
TL;DR: LANO introduces a novel framework for learning neural operators from partially observed data, addressing supervision gaps and spatial mismatches through mask-to-predict training and physics-aware latent propagation, achieving significant error reduction on PDE tasks.
Details
Motivation: Real-world scientific applications often have incomplete observational data due to sensor limitations, geographic constraints, or measurement costs. Current neural operators assume fully-observed spatial inputs, which severely restricts their applicability in practical scenarios where data is incomplete.Method: Proposes Latent Autoregressive Neural Operator (LANO) with two key components: (1) mask-to-predict training strategy that creates artificial supervision by strategically masking observed regions, and (2) Physics-Aware Latent Propagator that reconstructs solutions through boundary-first autoregressive generation in latent space. Also introduces POBench-PDE benchmark for evaluating neural operators under partial observation conditions.
Result: LANO achieves state-of-the-art performance with 18-69% relative L2 error reduction across all benchmarks under patch-wise missingness with less than 50% missing rate, including real-world climate prediction. The approach effectively handles scenarios with up to 75% missing rate, bridging the gap between idealized research settings and real-world scientific computing.
Conclusion: The proposed framework successfully addresses the fundamental challenges of learning neural operators from partial observations, providing a systematic solution that significantly advances the practical applicability of neural operators in real-world scientific applications with incomplete data.
Abstract: Real-world scientific applications frequently encounter incomplete observational data due to sensor limitations, geographic constraints, or measurement costs. Although neural operators significantly advanced PDE solving in terms of computational efficiency and accuracy, their underlying assumption of fully-observed spatial inputs severely restricts applicability in real-world applications. We introduce the first systematic framework for learning neural operators from partial observation. We identify and formalize two fundamental obstacles: (i) the supervision gap in unobserved regions that prevents effective learning of physical correlations, and (ii) the dynamic spatial mismatch between incomplete inputs and complete solution fields. Specifically, our proposed Latent Autoregressive Neural Operator~(\ours) introduces two novel components designed explicitly to address the core difficulties of partial observations: (i) a mask-to-predict training strategy that creates artificial supervision by strategically masking observed regions, and (ii) a Physics-Aware Latent Propagator that reconstructs solutions through boundary-first autoregressive generation in latent space. Additionally, we develop POBench-PDE, a dedicated and comprehensive benchmark designed specifically for evaluating neural operators under partial observation conditions across three PDE-governed tasks. \ours achieves state-of-the-art performance with 18–69$%$ relative L2 error reduction across all benchmarks under patch-wise missingness with less than 50$%$ missing rate, including real-world climate prediction. Our approach effectively addresses practical scenarios involving up to 75$%$ missing rate, to some extent bridging the existing gap between idealized research settings and the complexities of real-world scientific computing.
[313] BanditLP: Large-Scale Stochastic Optimization for Personalized Recommendations
Phuc Nguyen, Benjamin Zelditch, Joyce Chen, Rohit Patra, Changshuai Wei
Main category: cs.LG
TL;DR: BanditLP combines neural Thompson Sampling with large-scale linear programming for multi-stakeholder contextual bandits with constraints, achieving business wins at LinkedIn.
Details
Motivation: Need a scalable framework for multi-stakeholder contextual bandits that can handle exploration (Thompson Sampling) and constrained action selection at web scale, particularly for applications like LinkedIn's email marketing.Method: Unifies neural Thompson Sampling for learning objective-specific outcomes with a large-scale linear program for constrained action selection. The LP solver handles billions of variables, making it application-agnostic and compatible with arbitrary neural architectures.
Result: Experiments on public benchmarks and synthetic data show consistent gains over strong baselines. Application in LinkedIn’s email marketing system demonstrates business wins, proving the value of integrated exploration and constrained optimization in production.
Conclusion: BanditLP provides a scalable, practical framework for multi-stakeholder contextual bandits that successfully combines neural exploration with constrained optimization, delivering real-world business value at web scale.
Abstract: We present BanditLP, a scalable multi-stakeholder contextual bandit framework that unifies neural Thompson Sampling for learning objective-specific outcomes with a large-scale linear program for constrained action selection at serving time. The methodology is application-agnostic, compatible with arbitrary neural architectures, and deployable at web scale, with an LP solver capable of handling billions of variables. Experiments on public benchmarks and synthetic data show consistent gains over strong baselines. We apply this approach in LinkedIn’s email marketing system and demonstrate business win, illustrating the value of integrated exploration and constrained optimization in production.
[314] Deep Learning for Perishable Inventory Systems with Human Knowledge
Xuan Liao, Zhenkang Peng, Ying Rong
Main category: cs.LG
TL;DR: Deep learning-based policies for perishable inventory management with unknown demand and lead times, using marginal cost accounting and structure-guided approaches to improve learning efficiency.
Details
Motivation: Managing perishable products with limited lifetimes is challenging due to stockouts or waste. The problem is compounded when both demand process and lead time distribution are unknown, and only limited historical data with covariates is available.Method: Adopt marginal cost accounting scheme assigning each order a single lifetime cost for unified loss function. Develop two end-to-end deep learning variants: E2E-BB (black-box) and E2E-PIL (structure-guided with projected inventory level policy). Enhance E2E-PIL with boosting technique (E2E-BPIL) leveraging objective homogeneity.
Result: Experiments show robust performance ordering: E2E-BB dominated by E2E-PIL, which is further improved by E2E-BPIL. Structure-guided approach reduces effective model complexity and improves learning efficiency with modest flexibility loss.
Conclusion: Deep learning-based decision tools are more effective and robust when guided by human knowledge, highlighting the value of integrating advanced analytics with inventory theory.
Abstract: Managing perishable products with limited lifetimes is a fundamental challenge in inventory management, as poor ordering decisions can quickly lead to stockouts or excessive waste. We study a perishable inventory system with random lead times in which both the demand process and the lead time distribution are unknown. We consider a practical setting where orders are placed using limited historical data together with observed covariates and current system states. To improve learning efficiency under limited data, we adopt a marginal cost accounting scheme that assigns each order a single lifetime cost and yields a unified loss function for end-to-end learning. This enables training a deep learning-based policy that maps observed covariates and system states directly to order quantities. We develop two end-to-end variants: a purely black-box approach that outputs order quantities directly (E2E-BB), and a structure-guided approach that embeds the projected inventory level (PIL) policy, capturing inventory effects through explicit computation rather than additional learning (E2E-PIL). We further show that the objective induced by E2E-PIL is homogeneous of degree one, enabling a boosting technique from operational data analytics (ODA) that yields an enhanced policy (E2E-BPIL). Experiments on synthetic and real data establish a robust performance ordering: E2E-BB is dominated by E2E-PIL, which is further improved by E2E-BPIL. Using an excess-risk decomposition, we show that embedding heuristic policy structure reduces effective model complexity and improves learning efficiency with only a modest loss of flexibility. More broadly, our results suggest that deep learning-based decision tools are more effective and robust when guided by human knowledge, highlighting the value of integrating advanced analytics with inventory theory.
[315] Neural Nonlinear Shrinkage of Covariance Matrices for Minimum Variance Portfolio Optimization
Liusha Yang, Siqi Zhao, Shuqi Chai
Main category: cs.LG
TL;DR: Neural network-based nonlinear shrinkage estimator for covariance matrices that improves minimum variance portfolio optimization by combining statistical estimation with machine learning.
Details
Motivation: To develop a better covariance matrix estimator for portfolio optimization that directly targets risk minimization, moving beyond traditional linear shrinkage methods like Ledoit-Wolf.Method: Hybrid approach starting from Ledoit-Wolf estimator, decomposing it into eigenvalues/eigenvectors, then applying a lightweight transformer-based neural network to learn nonlinear eigenvalue shrinkage function. Trained with portfolio risk as loss function, conditioned on sample-to-dimension ratio for scalability.
Result: Empirical results on S&P500 daily returns show consistently lower out-of-sample realized risk compared to benchmark approaches.
Conclusion: The method demonstrates promise of integrating structural statistical models with data-driven learning for improved portfolio risk management.
Abstract: This paper introduces a neural network-based nonlinear shrinkage estimator of covariance matrices for the purpose of minimum variance portfolio optimization. It is a hybrid approach that integrates statistical estimation with machine learning. Starting from the Ledoit-Wolf (LW) shrinkage estimator, we decompose the LW covariance matrix into its eigenvalues and eigenvectors, and apply a lightweight transformer-based neural network to learn a nonlinear eigenvalue shrinkage function. Trained with portfolio risk as the loss function, the resulting precision matrix (the inverse covariance matrix) estimator directly targets portfolio risk minimization. By conditioning on the sample-to-dimension ratio, the approach remains scalable across different sample sizes and asset universes. Empirical results on stock daily returns from Standard & Poor’s 500 Index (S&P500) demonstrate that the proposed method consistently achieves lower out-of-sample realized risk than benchmark approaches. This highlights the promise of integrating structural statistical models with data-driven learning.
[316] When Sharpening Becomes Collapse: Sampling Bias and Semantic Coupling in RL with Verifiable Rewards
Mingyuan Fan, Weiguang Han, Daixin Wang, Cen Chen, Zhiqiang Zhang, Jun Zhou
Main category: cs.LG
TL;DR: RLVR (Reinforcement Learning with Verifiable Rewards) helps LLMs solve logic problems, but may cause “over-sharpening” where models collapse onto limited solution modes instead of exploring alternatives. The paper proposes calibration methods to improve generalization.
Details
Motivation: While RLVR is empirically successful for LLMs in logic-heavy domains, it's unclear whether it truly elicits novel capabilities or just sharpens existing knowledge distribution. The paper aims to understand if RLVR causes "over-sharpening" - where policies collapse onto limited modes and suppress valid alternatives.Method: The paper formalizes over-sharpening phenomenon and discovers that finite-batch updates intrinsically bias learning toward sampled modes. To mitigate this, they propose two calibration strategies: 1) inverse-success advantage calibration to prioritize difficult queries, and 2) distribution-level calibration to diversify sampling via a memory network.
Result: Empirical evaluations validate that the proposed strategies can effectively improve generalization by preventing policy collapse and promoting exploration of alternative solutions.
Conclusion: RLVR can suffer from over-sharpening that limits exploration of valid alternatives. The proposed calibration methods help mitigate this issue and improve generalization in logic-heavy problem solving with LLMs.
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a central paradigm for turning large language models (LLMs) into reliable problem solvers, especially in logic-heavy domains. Despite its empirical success, it remains unclear whether RLVR elicits novel capabilities or merely sharpens the distribution over existing knowledge. We study this by formalizing over-sharpening, a phenomenon where the policy collapses onto limited modes, suppressing valid alternatives. At a high level, we discover finite-batch updates intrinsically bias learning toward sampled modes, triggering a collapse that propagates globally via semantic coupling. To mitigate this, we propose inverse-success advantage calibration to prioritize difficult queries and distribution-level calibration to diversify sampling via a memory network. Empirical evaluations validate that our strategies can effectively improve generalization.
[317] Closing the Gap on the Sample Complexity of 1-Identification
Zitian Li, Wang Chi Cheung
Main category: cs.LG
TL;DR: This paper addresses the 1-identification problem in multi-armed bandits, developing new lower bounds and algorithms for identifying whether any arm meets a reward threshold, with tight theoretical guarantees.
Details
Motivation: The motivation is to solve the fundamental 1-identification problem in pure exploration bandits, where an agent needs to determine if any arm has mean reward ≥ threshold μ₀. Previous literature left open the analysis of expected total pulling times when multiple qualified arms exist.Method: The authors use an optimization formulation to derive new lower bounds for expected pulling times when qualified arms exist. They also design a new algorithm that achieves tight upper bounds with logarithmic gaps to the lower bounds.
Result: The paper provides: (1) A new lower bound for expected total pulling times when at least one qualified arm exists, and (2) A new algorithm with tight upper bounds where the gap to lower bounds is only polynomial in logarithm factors across all problem instances.
Conclusion: This work successfully addresses the open problem of analyzing expected total pulling times when multiple qualified arms exist in 1-identification, providing both theoretical lower bounds and practical algorithms with near-optimal performance guarantees.
Abstract: 1-identification is a fundamental multi-armed bandit formulation on pure exploration. An agent aims to determine whether there exists a qualified arm whose mean reward is not less than a known threshold $μ_0$, or to output \textsf{None} if it believes such an arm does not exist. The agent needs to guarantee its output is correct with probability at least $1-δ$, while making expected total pulling times $\mathbb{E}τ$ as small as possible. We work on 1-identification with two main contributions. (1) We utilize an optimization formulation to derive a new lower bound of $\mathbb{E}τ$, when there is at least one qualified arm. (2) We design a new algorithm, deriving tight upper bounds whose gap to lower bounds are up to a polynomial of logarithm factor across all problem instance. Our result complements the analysis of $\mathbb{E}τ$ when there are multiple qualified arms, which is an open problem left by history literature.
[318] Robust Tool Use via Fission-GRPO: Learning to Recover from Execution Errors
Zhiwei Zhang, Fei Zhao, Rui Wang, Zezhong Wang, Bin Liang, Jiakang Wang, Yao Hu, Shaosheng Cao, Kam-Fai Wong
Main category: cs.LG
TL;DR: Fission-GRPO improves LLM error recovery in multi-turn tool execution by converting execution errors into corrective supervision during RL training, using an Error Simulator to generate diagnostic feedback for failed trajectories.
Details
Motivation: Current LLMs are brittle in multi-turn tool execution - they fail to interpret error feedback and self-correct after tool call errors. Standard RL treats errors as sparse negative rewards without recovery guidance, and synthetic error-correction datasets suffer from distribution mismatch with models' actual error modes.Method: Fission-GRPO converts execution errors into corrective supervision within RL training loop. It fissions failed trajectories into new training instances by augmenting them with diagnostic feedback from a finetuned Error Simulator, then resamples recovery rollouts on-policy. This enables learning from precise errors made during exploration rather than static pre-collected cases.
Result: On BFCL v4 Multi-Turn benchmark, Fission-GRPO improves error recovery rate of Qwen3-8B by 5.7% absolute, yielding 4% overall accuracy gain (42.75% to 46.75%) over GRPO and outperforming specialized tool-use agents.
Conclusion: Fission-GRPO effectively addresses LLM brittleness in multi-turn tool execution by enabling models to learn from their own errors during exploration, providing a more robust approach to error recovery that improves overall tool-use performance.
Abstract: Large language models (LLMs) can call tools effectively, yet they remain brittle in multi-turn execution: following a tool call error, smaller models often degenerate into repetitive invalid re-invocations, failing to interpret error feedback and self-correct. This brittleness hinders reliable real-world deployment, where the execution errors are inherently inevitable during tool interaction procedures. We identify a key limitation of current approaches: standard reinforcement learning (RL) treats errors as sparse negative rewards, providing no guidance on how to recover, while pre-collected synthetic error-correction datasets suffer from distribution mismatch with the model’s on-policy error modes. To bridge this gap, we propose Fission-GRPO, a framework that converts execution errors into corrective supervision within the RL training loop. Our core mechanism fissions each failed trajectory into a new training instance by augmenting it with diagnostic feedback from a finetuned Error Simulator, then resampling recovery rollouts on-policy. This enables the model to learn from the precise errors it makes during exploration, rather than from static, pre-collected error cases. On the BFCL v4 Multi-Turn, Fission-GRPO improves the error recovery rate of Qwen3-8B by 5.7% absolute, crucially, yielding a 4% overall accuracy gain (42.75% to 46.75%) over GRPO and outperforming specialized tool-use agents.
[319] An Empirical Study on Ensemble-Based Transfer Learning Bayesian Optimisation with Mixed Variable Types
Natasha Trinkle, Huong Ha, Jeffrey Chan
Main category: cs.LG
TL;DR: Empirical analysis of ensemble-based transfer learning methods for Bayesian optimization, introducing new pipeline components and benchmarks, finding warm start initialization and positive weight constraints improve performance.
Details
Motivation: Bayesian optimization is sample-efficient but expensive; leveraging historical datasets from related problems through transfer learning can improve performance, but requires empirical analysis of different methods and pipeline components.Method: Empirical analysis of ensemble-based transfer learning Bayesian optimization methods, introducing new pipeline components including: 1) weighting strategy for ensemble surrogate models using regularized regression with positive weight constraints, 2) component for handling cases when transfer learning doesn’t improve performance, and 3) three new real-time transfer learning Bayesian optimization benchmarks.
Result: Two key components consistently improve transfer learning Bayesian optimization performance: warm start initialization and constraining ensemble surrogate model weights to be positive.
Conclusion: Transfer learning can enhance Bayesian optimization performance, with warm start initialization and positive weight constraints being particularly effective strategies for ensemble-based approaches.
Abstract: Bayesian optimisation is a sample efficient method for finding a global optimum of expensive black-box objective functions. Historic datasets from related problems can be exploited to help improve performance of Bayesian optimisation by adapting transfer learning methods to various components of the Bayesian optimisation pipeline. In this study we perform an empirical analysis of various ensemble-based transfer learning Bayesian optimisation methods and pipeline components. We expand on previous work in the literature by contributing some specific pipeline components, and three new real-time transfer learning Bayesian optimisation benchmarks. In particular we propose to use a weighting strategy for ensemble surrogate model predictions based on regularised regression with weights constrained to be positive, and a related component for handling the case when transfer learning is not improving Bayesian optimisation performance. We find that in general, two components that help improve transfer learning Bayesian optimisation performance are warm start initialisation and constraining weights used with ensemble surrogate model to be positive.
[320] Integrating Knowledge Distillation Methods: A Sequential Multi-Stage Framework
Yinxi Tian, Changwu Huang, Ke Tang, Xin Yao
Main category: cs.LG
TL;DR: SMSKD is a sequential multi-stage knowledge distillation framework that integrates heterogeneous KD methods sequentially while using frozen reference models to prevent catastrophic forgetting, with adaptive weighting based on teacher true class probability.
Details
Motivation: Current knowledge distillation methods face challenges when integrating multiple approaches: complex implementation, inflexible combinations, and catastrophic forgetting that limits practical effectiveness despite the promise of combining different knowledge sources.Method: SMSKD sequentially trains students with different distillation methods at each stage, using frozen reference models from previous stages to anchor learned knowledge and prevent forgetting. Includes adaptive weighting mechanism based on teacher true class probability (TCP) to dynamically adjust reference loss per sample.
Result: Extensive experiments show SMSKD consistently improves student accuracy across diverse teacher-student architectures and method combinations, outperforming existing baselines. Supports arbitrary method combinations with negligible computational overhead.
Conclusion: SMSKD is a practical and resource-efficient solution for integrating heterogeneous KD methods, with stage-wise distillation and reference model supervision being primary contributors to performance gains, complemented by TCP-based adaptive weighting.
Abstract: Knowledge distillation (KD) transfers knowledge from large teacher models to compact student models, enabling efficient deployment on resource constrained devices. While diverse KD methods, including response based, feature based, and relation based approaches, capture different aspects of teacher knowledge, integrating multiple methods or knowledge sources is promising but often hampered by complex implementation, inflexible combinations, and catastrophic forgetting, which limits practical effectiveness. This work proposes SMSKD (Sequential Multi Stage Knowledge Distillation), a flexible framework that sequentially integrates heterogeneous KD methods. At each stage, the student is trained with a specific distillation method, while a frozen reference model from the previous stage anchors learned knowledge to mitigate forgetting. In addition, we introduce an adaptive weighting mechanism based on the teacher true class probability (TCP) that dynamically adjusts the reference loss per sample to balance knowledge retention and integration. By design, SMSKD supports arbitrary method combinations and stage counts with negligible computational overhead. Extensive experiments show that SMSKD consistently improves student accuracy across diverse teacher student architectures and method combinations, outperforming existing baselines. Ablation studies confirm that stage wise distillation and reference model supervision are primary contributors to performance gains, with TCP based adaptive weighting providing complementary benefits. Overall, SMSKD is a practical and resource efficient solution for integrating heterogeneous KD methods.
[321] Dualformer: Time-Frequency Dual Domain Learning for Long-term Time Series Forecasting
Jingjing Bai, Yoshinobu Kawahara
Main category: cs.LG
TL;DR: Dualformer is a transformer-based framework for long-term time series forecasting that addresses the low-pass filtering effect by introducing dual-domain modeling with hierarchical frequency sampling and periodicity-aware weighting.
Details
Motivation: Transformer models for time series forecasting suffer from an inherent low-pass filtering effect that progressively attenuates high-frequency information, limiting their ability to capture fine-grained temporal variations.Method: Dualformer introduces: 1) dual-branch architecture for concurrent time and frequency domain modeling, 2) hierarchical frequency sampling that allocates distinct frequency bands to different layers, and 3) periodicity-aware weighting mechanism that dynamically balances contributions from dual branches.
Result: Extensive experiments on eight benchmarks demonstrate Dualformer’s robustness and superior performance, particularly on heterogeneous or weakly periodic data, effectively preserving high-frequency information and enhancing generalization.
Conclusion: Dualformer provides a principled dual-domain framework that enables structured frequency modeling and adaptive integration of time-frequency features, addressing the limitations of transformer models in long-term time series forecasting.
Abstract: Transformer-based models, despite their promise for long-term time series forecasting (LTSF), suffer from an inherent low-pass filtering effect that limits their effectiveness. This issue arises due to undifferentiated propagation of frequency components across layers, causing a progressive attenuation of high-frequency information crucial for capturing fine-grained temporal variations. To address this limitation, we propose Dualformer, a principled dual-domain framework that rethinks frequency modeling from a layer-wise perspective. Dualformer introduces three key components: (1) a dual-branch architecture that concurrently models complementary temporal patterns in both time and frequency domains; (2) a hierarchical frequency sampling module that allocates distinct frequency bands to different layers, preserving high-frequency details in lower layers while modeling low-frequency trends in deeper layers; and (3) a periodicity-aware weighting mechanism that dynamically balances contributions from the dual branches based on the harmonic energy ratio of inputs, supported theoretically by a derived lower bound. This design enables structured frequency modeling and adaptive integration of time-frequency features, effectively preserving high-frequency information and enhancing generalization. Extensive experiments conducted on eight widely used benchmarks demonstrate Dualformer’s robustness and superior performance, particularly on heterogeneous or weakly periodic data. Our code is publicly available at https://github.com/Akira-221/Dualformer.
[322] Beyond Hard Writes and Rigid Preservation: Soft Recursive Least-Squares for Lifelong LLM Editing
Xinyu Wang, Sicheng Lyu, Yu Gu, Jerry Huang, Peng Lu, Yufei Cui, Xiao-Wen Chang
Main category: cs.LG
TL;DR: RLSEdit is a recursive least-squares editor for long sequential model editing that solves the plasticity-stability dilemma by formulating editing as online quadratic optimization with soft constraints, enabling stable scaling to 10K+ edits while preserving both edit adherence and general capabilities.
Details
Motivation: Existing model editors face a plasticity-stability dilemma in long sequential editing: "hard writes" accumulate interference over time, while "hard preservation" methods can overwrite past edits and degrade general capabilities. There's a need for editors that can handle long edit streams while maintaining both edit success and model stability.Method: RLSEdit formulates editing as an online quadratic optimization with soft constraints, minimizing a cumulative key-value fitting objective with two regularizers: one controlling deviation from pre-trained weights, and another from a designated anchor mapping. The update uses efficient online recursion via the Woodbury identity, with per-edit cost independent of history length.
Result: Experiments show RLSEdit scales stably to 10K edits, outperforming baselines in both edit success and holistic stability. It retains early edits while preserving general capabilities on GLUE and held-out reasoning/code benchmarks, with per-edit cost scaling only with current edit size.
Conclusion: RLSEdit provides an effective solution for long sequential model editing, balancing plasticity and stability through its recursive least-squares formulation with soft constraints, enabling practical deployment of model editors in real-world scenarios with continuous edit streams.
Abstract: Model editing updates a pre-trained LLM with new facts or rules without re-training, while preserving unrelated behavior. In real deployment, edits arrive as long streams, and existing editors often face a plasticity-stability dilemma: locate-then-edit “hard writes” can accumulate interference over time, while null-space-style “hard preservation” preserves only what is explicitly constrained, so past edits can be overwritten and unconstrained behaviors may deviate, degrading general capabilities in the many-edits regime. We propose RLSEdit, a recursive least-squares editor for long sequential editing. RLSEdit formulates editing as an online quadratic optimization with soft constraints, minimizing a cumulative key-value fitting objective with two regularizers that control for both deviation from the pre-trained weights and from a designated anchor mapping. The resulting update admits an efficient online recursion via the Woodbury identity, with per-edit cost independent of history length and scaling only with the current edit size. We further provide deviation bounds and an asymptotic characterization of the adherence-preservation trade-off in the many-edits regime. Experiments on multiple model families demonstrate stable scaling to 10K edits, outperforming strong baselines in both edit success and holistic stability – crucially retaining early edits, and preserving general capabilities on GLUE and held-out reasoning/code benchmarks.
[323] Even GPT-5.2 Can’t Count to Five: The Case for Zero-Error Horizons in Trustworthy LLMs
Ryoma Sato
Main category: cs.LG
TL;DR: The paper introduces Zero-Error Horizon (ZEH) as a metric for evaluating LLMs’ maximum error-free problem-solving range, revealing surprising limitations in state-of-the-art models like GPT-5.2 on simple tasks.
Details
Motivation: Current LLM evaluation focuses on accuracy metrics but doesn't measure the range of error-free performance. The authors want to understand how far LLMs can solve problems without any mistakes, which is crucial for safety-critical applications where even single errors can be catastrophic.Method: Propose Zero-Error Horizon (ZEH) metric to measure maximum problem size/complexity that LLMs can solve without errors. Evaluate ZEH on state-of-the-art LLMs (GPT-5.2, Qwen2.5) using simple algorithmic tasks like parity computation and parenthesis balancing. Use computational optimizations like tree structures and online softmax to reduce evaluation cost.
Result: Surprisingly, GPT-5.2 fails on simple tasks: cannot compute parity of short string “11000” and cannot determine balance of parentheses “((((())))))”. ZEH correlates with accuracy but reveals different behavioral patterns. Analysis provides insights into emergence of algorithmic capabilities. Computational optimizations achieve up to 10x speedup in ZEH evaluation.
Conclusion: ZEH is a valuable metric for trustworthy LLM evaluation, revealing fundamental limitations even in advanced models. The findings caution against over-reliance on LLMs for safety-critical applications and provide insights into algorithmic capability development. Optimized evaluation methods make ZEH practical for broader use.
Abstract: We propose Zero-Error Horizon (ZEH) for trustworthy LLMs, which represents the maximum range that a model can solve without any errors. While ZEH itself is simple, we demonstrate that evaluating the ZEH of state-of-the-art LLMs yields abundant insights. For example, by evaluating the ZEH of GPT-5.2, we found that GPT-5.2 cannot even compute the parity of a short string like 11000, and GPT-5.2 cannot determine whether the parentheses in ((((()))))) are balanced. This is surprising given the excellent capabilities of GPT-5.2. The fact that LLMs make mistakes on such simple problems serves as an important lesson when applying LLMs to safety-critical domains. By applying ZEH to Qwen2.5 and conducting detailed analysis, we found that while ZEH correlates with accuracy, the detailed behaviors differ, and ZEH provides clues about the emergence of algorithmic capabilities. Finally, while computing ZEH incurs significant computational cost, we discuss how to mitigate this cost by achieving up to one order of magnitude speedup using tree structures and online softmax.
[324] Communication-efficient Federated Graph Classification via Generative Diffusion Modeling
Xiuling Wang, Xin Huang, Haibo Hu, Jianliang Xu
Main category: cs.LG
TL;DR: CeFGC is a novel federated GNN paradigm that reduces communication to only 3 rounds by using generative diffusion models to capture local graph distributions, addressing high communication overhead and non-IID data challenges in federated GNN training.
Details
Motivation: Federated GNNs face two major challenges: 1) high communication overhead from multiple rounds of parameter exchanges, and 2) non-IID data characteristics across clients that degrade model performance.Method: CeFGC limits communication to only 3 rounds: 1) Clients train generative diffusion models on local graphs and share them with server, 2) Server redistributes all generative models to clients for synthetic graph generation, 3) Clients train local GNNs on combined real+synthetic graphs and upload weights for server aggregation.
Result: Theoretical analysis shows CeFGC reduces communication to constant 3 rounds. Extensive experiments on real graph datasets demonstrate superior performance against state-of-the-art competitors, especially on non-IID graphs, by aligning local/global objectives and enriching training data with diverse synthetic graphs.
Conclusion: CeFGC provides an efficient solution for federated GNN training over non-IID data by dramatically reducing communication overhead while maintaining model performance through generative diffusion-based data sharing.
Abstract: Graph Neural Networks (GNNs) unlock new ways of learning from graph-structured data, proving highly effective in capturing complex relationships and patterns. Federated GNNs (FGNNs) have emerged as a prominent distributed learning paradigm for training GNNs over decentralized data. However, FGNNs face two significant challenges: high communication overhead from multiple rounds of parameter exchanges and non-IID data characteristics across clients. To address these issues, we introduce CeFGC, a novel FGNN paradigm that facilitates efficient GNN training over non-IID data by limiting communication between the server and clients to three rounds only. The core idea of CeFGC is to leverage generative diffusion models to minimize direct client-server communication. Each client trains a generative diffusion model that captures its local graph distribution and shares this model with the server, which then redistributes it back to all clients. Using these generative models, clients generate synthetic graphs combined with their local graphs to train local GNN models. Finally, clients upload their model weights to the server for aggregation into a global GNN model. We theoretically analyze the I/O complexity of communication volume to show that CeFGC reduces to a constant of three communication rounds only. Extensive experiments on several real graph datasets demonstrate the effectiveness and efficiency of CeFGC against state-of-the-art competitors, reflecting our superior performance on non-IID graphs by aligning local and global model objectives and enriching the training set with diverse graphs.
[325] Towards Automated Kernel Generation in the Era of LLMs
Yang Yu, Peiyu Zang, Chi Hsu Tsai, Haiming Wu, Yixin Shen, Jialing Zhang, Haoyu Wang, Zhiyou Xiao, Jingze Shi, Yuyu Luo, Wentao Zhang, Chunlei Men, Guang Liu, Yonghua Lin
Main category: cs.LG
TL;DR: This survey paper provides a systematic overview of LLM-driven kernel generation and optimization, addressing the fragmentation in this emerging field by organizing approaches, datasets, benchmarks, and outlining future research directions.
Details
Motivation: Kernel engineering is critical for AI system performance but requires expert-level hardware knowledge and is time-consuming/non-scalable. Recent LLM advances offer automation potential, but the field lacks systematic organization and perspective.Method: The paper conducts a structured survey covering: 1) LLM-based approaches for kernel generation, 2) Agentic optimization workflows that create feedback-driven loops, 3) Systematic compilation of datasets and benchmarks for learning and evaluation.
Result: Provides comprehensive organization of existing approaches in LLM-driven kernel generation, identifies key challenges, and establishes a reference framework for future automated kernel optimization research.
Conclusion: LLMs and agentic systems show promise for automating kernel optimization, but systematic frameworks are needed. The survey addresses fragmentation and provides foundation for next-generation automated kernel optimization, with ongoing tracking via open-source repository.
Abstract: The performance of modern AI systems is fundamentally constrained by the quality of their underlying kernels, which translate high-level algorithmic semantics into low-level hardware operations. Achieving near-optimal kernels requires expert-level understanding of hardware architectures and programming models, making kernel engineering a critical but notoriously time-consuming and non-scalable process. Recent advances in large language models (LLMs) and LLM-based agents have opened new possibilities for automating kernel generation and optimization. LLMs are well-suited to compress expert-level kernel knowledge that is difficult to formalize, while agentic systems further enable scalable optimization by casting kernel development as an iterative, feedback-driven loop. Rapid progress has been made in this area. However, the field remains fragmented, lacking a systematic perspective for LLM-driven kernel generation. This survey addresses this gap by providing a structured overview of existing approaches, spanning LLM-based approaches and agentic optimization workflows, and systematically compiling the datasets and benchmarks that underpin learning and evaluation in this domain. Moreover, key open challenges and future research directions are further outlined, aiming to establish a comprehensive reference for the next generation of automated kernel optimization. To keep track of this field, we maintain an open-source GitHub repository at https://github.com/flagos-ai/awesome-LLM-driven-kernel-generation.
[326] Rethinking Drug-Drug Interaction Modeling as Generalizable Relation Learning
Dong Xu, Jiantao Wu, Qihua Pan, Sisi Yuan, Zexuan Zhu, Junkai Ji
Main category: cs.LG
TL;DR: GenRel-DDI is a relation-centric learning framework for drug-drug interaction prediction that learns transferable interaction patterns independent of drug identities, enabling better generalization to unseen drugs compared to molecule-centric approaches.
Details
Motivation: Existing DDI prediction methods perform well on benchmarks but fail to generalize to real-world scenarios where most drug pairs involve unseen drugs and validated interactions are scarce. Current molecule-centric models don't reliably correlate embedding proximity with interaction labels, and scaling model capacity doesn't improve generalization.Method: GenRel-DDI reformulates DDI prediction as a relation-centric learning problem where interaction representations are learned independently of drug identities. This relation-level abstraction captures transferable interaction patterns that generalize to unseen drugs and novel drug pairs.
Result: Extensive experiments across multiple benchmarks show GenRel-DDI consistently and significantly outperforms state-of-the-art methods, with particularly large gains on strict entity-disjoint evaluations where drugs are completely unseen during training.
Conclusion: Relation learning provides an effective and practical approach for robust DDI prediction, addressing the generalization limitations of molecule-centric models and enabling better performance in realistic deployment scenarios with unseen drugs.
Abstract: Drug-drug interaction (DDI) prediction is central to drug discovery and clinical development, particularly in the context of increasingly prevalent polypharmacy. Although existing computational methods achieve strong performance on standard benchmarks, they often fail to generalize to realistic deployment scenarios, where most candidate drug pairs involve previously unseen drugs and validated interactions are scarce. We demonstrate that proximity in the embedding spaces of prevailing molecule-centric DDI models does not reliably correspond to interaction labels, and that simply scaling up model capacity therefore fails to improve generalization. To address these limitations, we propose GenRel-DDI, a generalizable relation learning framework that reformulates DDI prediction as a relation-centric learning problem, in which interaction representations are learned independently of drug identities. This relation-level abstraction enables the capture of transferable interaction patterns that generalize to unseen drugs and novel drug pairs. Extensive experiments across multiple benchmark demonstrate that GenRel-DDI consistently and significantly outperforms state-of-the-art methods, with particularly large gains on strict entity-disjoint evaluations, highlighting the effectiveness and practical utility of relation learning for robust DDI prediction. The code is available at https://github.com/SZU-ADDG/GenRel-DDI.
[327] Next Generation Active Learning: Mixture of LLMs in the Loop
Yuanyuan Qi, Xiaohao Yang, Jueqing Lu, Guoxiang Guo, Joanne Enticott, Gang Liu, Lan Du
Main category: cs.LG
TL;DR: A novel active learning framework that uses a mixture of LLMs as annotators instead of humans, with noise mitigation techniques to achieve human-comparable performance using lightweight local LLMs.
Details
Motivation: LLMs are increasingly used in active learning to reduce annotation costs, but their annotation quality often falls short of real-world applicability. There's a need to improve LLM-based annotation robustness while maintaining cost-effectiveness.Method: Proposes Mixture of LLMs in the Loop Active Learning framework: 1) Replaces human annotators with a Mixture-of-LLMs-based annotation model to aggregate strengths of multiple LLMs, 2) Uses annotation discrepancy to identify unreliable annotations, 3) Employs negative learning to enhance learning effectiveness and mitigate noisy label impact.
Result: Extensive experiments show the framework achieves performance comparable to human annotation, consistently outperforms single-LLM baselines and other LLM-ensemble approaches, and operates fully on local machines using lightweight LLMs.
Conclusion: The proposed framework successfully addresses LLM annotation quality limitations by leveraging multiple LLMs with noise mitigation techniques, achieving human-level performance while enabling practical local deployment with lightweight models.
Abstract: With the rapid advancement and strong generalization capabilities of large language models (LLMs), they have been increasingly incorporated into the active learning pipelines as annotators to reduce annotation costs. However, considering the annotation quality, labels generated by LLMs often fall short of real-world applicability. To address this, we propose a novel active learning framework, Mixture of LLMs in the Loop Active Learning, replacing human annotators with labels generated through a Mixture-of-LLMs-based annotation model, aimed at enhancing LLM-based annotation robustness by aggregating the strengths of multiple LLMs. To further mitigate the impact of the noisy labels, we introduce annotation discrepancy and negative learning to identify the unreliable annotations and enhance learning effectiveness. Extensive experiments demonstrate that our framework achieves performance comparable to human annotation and consistently outperforms single-LLM baselines and other LLM-ensemble-based approaches. Moreover, our framework is built on lightweight LLMs, enabling it to operate fully on local machines in real-world applications.
[328] Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models
Fengheng Chu, Jiahao Chen, Yuhong Wang, Jun Wang, Zhihui Fu, Shouling Ji, Songze Li
Main category: cs.LG
TL;DR: GOSV identifies safety-critical attention heads in LLMs through global optimization, revealing separate functional pathways for safety and enabling effective white-box jailbreak attacks.
Details
Motivation: Current safety guardrails in LLMs are fragile against jailbreak attacks, and existing attribution methods fail to capture cooperative interactions between attention heads that jointly contribute to safety mechanisms.Method: Proposes GOSV (Global Optimization for Safety Vector Extraction) that identifies safety-critical attention heads through global optimization over all heads simultaneously, using two complementary activation repatching strategies: Harmful Patching and Zero Ablation.
Result: Identifies two spatially distinct sets of safety vectors (Malicious Injection Vectors and Safety Suppression Vectors) with low overlap, showing aligned LLMs maintain separate safety pathways. Complete safety breakdown occurs at ~30% head repatching. Developed novel white-box jailbreak method outperforming existing attacks.
Conclusion: GOSV provides effective framework for LLM safety interpretability by revealing cooperative safety mechanisms through global optimization, enabling better understanding and exploitation of safety vulnerabilities.
Abstract: While Large Language Models (LLMs) are aligned to mitigate risks, their safety guardrails remain fragile against jailbreak attacks. This reveals limited understanding of components governing safety. Existing methods rely on local, greedy attribution that assumes independent component contributions. However, they overlook the cooperative interactions between different components in LLMs, such as attention heads, which jointly contribute to safety mechanisms. We propose \textbf{G}lobal \textbf{O}ptimization for \textbf{S}afety \textbf{V}ector Extraction (GOSV), a framework that identifies safety-critical attention heads through global optimization over all heads simultaneously. We employ two complementary activation repatching strategies: Harmful Patching and Zero Ablation. These strategies identify two spatially distinct sets of safety vectors with consistently low overlap, termed Malicious Injection Vectors and Safety Suppression Vectors, demonstrating that aligned LLMs maintain separate functional pathways for safety purposes. Through systematic analyses, we find that complete safety breakdown occurs when approximately 30% of total heads are repatched across all models. Building on these insights, we develop a novel inference-time white-box jailbreak method that exploits the identified safety vectors through activation repatching. Our attack substantially outperforms existing white-box attacks across all test models, providing strong evidence for the effectiveness of the proposed GOSV framework on LLM safety interpretability.
[329] Uncertainty-guided Generation of Dark-field Radiographs
Lina Felsner, Henriette Bast, Tina Dorosti, Florian Schaff, Franz Pfeiffer, Daniela Pfeiffer, Julia Schnabel
Main category: cs.LG
TL;DR: First framework to generate X-ray dark-field images from standard chest X-rays using uncertainty-guided GAN, achieving high fidelity and good generalization.
Details
Motivation: X-ray dark-field radiography provides complementary diagnostic information but has limited data availability, making it challenging to develop robust deep learning models for clinical applications.Method: Uncertainty-Guided Progressive Generative Adversarial Network that incorporates both aleatoric and epistemic uncertainty to improve interpretability and reliability of generated dark-field images.
Result: High structural fidelity of generated images with consistent improvement of quantitative metrics across stages. Out-of-distribution evaluation confirms good generalization of the proposed model.
Conclusion: Uncertainty-guided generative modeling enables realistic dark-field image synthesis and provides a reliable foundation for future clinical applications in X-ray imaging.
Abstract: X-ray dark-field radiography provides complementary diagnostic information to conventional attenuation imaging by visualizing microstructural tissue changes through small-angle scattering. However, the limited availability of such data poses challenges for developing robust deep learning models. In this work, we present the first framework for generating dark-field images directly from standard attenuation chest X-rays using an Uncertainty-Guided Progressive Generative Adversarial Network. The model incorporates both aleatoric and epistemic uncertainty to improve interpretability and reliability. Experiments demonstrate high structural fidelity of the generated images, with consistent improvement of quantitative metrics across stages. Furthermore, out-of-distribution evaluation confirms that the proposed model generalizes well. Our results indicate that uncertainty-guided generative modeling enables realistic dark-field image synthesis and provides a reliable foundation for future clinical applications.
[330] Why Inference in Large Models Becomes Decomposable After Training
Jidong Jin
Main category: cs.LG
TL;DR: Post-training inference systems in large AI models are structurally decomposable due to localized gradient updates, enabling parallel inference without modifying model functionality.
Details
Motivation: Current inference in large-scale AI models uses dense parameter matrices, leading to unsustainable scaling of inference cost and system complexity. The problem isn't insufficient model capacity, but treating inference systems as monolithic operators while ignoring internal structures formed during learning.Method: The authors show that gradient updates in large models are highly localized and selective, leaving many parameter dependencies statistically indistinguishable from initialization. They introduce a post-training statistical criterion and structural annealing procedure to remove unsupported dependencies and reveal stable, independent substructures.
Result: The approach establishes a post-training, model-agnostic structural view of inference systems that enables structured, parallel inference without modifying model functionality or interfaces.
Conclusion: Post-training inference systems are inherently decomposable due to localized learning patterns, enabling more efficient parallel inference through structural analysis without changing model behavior.
Abstract: Inference in large-scale AI models is typically performed on dense parameter matrices, leading to inference cost and system complexity that scale unsustainably with model size. This limitation does not arise from insufficient model capacity, but from treating post-training inference systems as monolithic operators while ignoring internal structures formed during learning. We show that gradient update events in large models are highly localized and selective, leaving many parameter dependencies statistically indistinguishable from their initialization distribution after training. As a result, post-training inference systems are structurally non-uniform and inherently decomposable. Based on this observation, we introduce a post-training statistical criterion and a structural annealing procedure that removes unsupported dependencies and reveals stable, independent substructures. This work establishes a post-training, model-agnostic structural view of inference systems and enables structured, parallel inference without modifying model functionality or interfaces.
[331] SoK: Challenges in Tabular Membership Inference Attacks
Cristina Pêra, Tânia Carvalho, Maxime Cordy, Luís Antunes
Main category: cs.LG
TL;DR: MIAs show poor performance on tabular data but can effectively expose single-out records; using different surrogate models improves attack effectiveness.
Details
Motivation: To address unexplored concerns about MIAs on tabular data, evaluate their efficacy across centralized/federated learning, and investigate threats from outsider adversaries in federated settings.Method: Extensive review and refined taxonomy of MIAs for centralized/federated learning; empirical evaluation using multiple attack strategies and defenses on tabular data; analysis of single-out vulnerability and cross-architecture transferability.
Result: MIAs perform poorly on tabular data compared to previous SOTA; limited attacks still successfully expose most single-outs; different surrogate models enhance attack effectiveness; outsider adversaries pose significant threat in federated learning.
Conclusion: Tabular data shows general resistance to MIAs, but single-outs remain highly vulnerable; attack transferability across architectures and surrogate model diversity improve MIA success; federated learning requires stronger defenses against outsider threats.
Abstract: Membership Inference Attacks (MIAs) are currently a dominant approach for evaluating privacy in machine learning applications. Despite their significance in identifying records belonging to the training dataset, several concerns remain unexplored, particularly with regard to tabular data. In this paper, first, we provide an extensive review and analysis of MIAs considering two main learning paradigms: centralized and federated learning. We extend and refine the taxonomy for both. Second, we demonstrate the efficacy of MIAs in tabular data using several attack strategies, also including defenses. Furthermore, in a federated learning scenario, we consider the threat posed by an outsider adversary, which is often neglected. Third, we demonstrate the high vulnerability of single-outs (records with a unique signature) to MIAs. Lastly, we explore how MIAs transfer across model architectures. Our results point towards a general poor performance of these attacks in tabular data which contrasts with previous state-of-the-art. Notably, even attacks with limited attack performance can still successfully expose a large portion of single-outs. Moreover, our findings suggest that using different surrogate models makes MIAs more effective.
[332] Iterative Amortized Hierarchical VAE
Simon W. Penninga, Ruud J. G. van Sloun
Main category: cs.LG
TL;DR: IA-HVAE combines amortized inference with iterative refinement using decoder gradients, achieving 35x speed-up over traditional HVAE while improving reconstruction quality for inverse problems.
Details
Motivation: To overcome limitations of traditional hierarchical variational autoencoders by creating a more efficient inference scheme that combines the speed of amortized inference with the accuracy of iterative refinement.Method: Proposes Iterative Amortized Hierarchical VAE with hybrid scheme: initial amortized guess followed by iterative refinement using decoder gradients. Uses linearly separable decoder in transform domain (e.g., Fourier space) to enable real-time applications with high model depths.
Result: Achieves 35x speed-up for iterative inference compared to traditional HVAE. Hybrid approach outperforms fully amortized and fully iterative methods in accuracy and speed respectively. Shows improved reconstruction quality over vanilla HVAE in inverse problems like deblurring and denoising.
Conclusion: IA-HVAE successfully combines amortized and iterative inference, offering significant speed improvements while maintaining or improving reconstruction quality, making it suitable for real-time applications and inverse problem solving.
Abstract: In this paper we propose the Iterative Amortized Hierarchical Variational Autoencoder (IA-HVAE), which expands on amortized inference with a hybrid scheme containing an initial amortized guess and iterative refinement with decoder gradients. We achieve this by creating a linearly separable decoder in a transform domain (e.g. Fourier space), enabling real-time applications with very high model depths. The architectural change leads to a 35x speed-up for iterative inference with respect to the traditional HVAE. We show that our hybrid approach outperforms fully amortized and fully iterative equivalents in accuracy and speed respectively. Moreover, the IAHVAE shows improved reconstruction quality over a vanilla HVAE in inverse problems such as deblurring and denoising.
[333] Predicting Healthcare System Visitation Flow by Integrating Hospital Attributes and Population Socioeconomics with Human Mobility Data
Binbin Lin, Lei Zou, Hao Tian, Heng Cai, Yifan Yang, Bing Zhou
Main category: cs.LG
TL;DR: Deep Gravity model best predicts hospital visitation patterns by integrating hospital attributes, population SES, and spatial mobility, revealing distance-dependent factors and demographic variations in healthcare access.
Details
Motivation: Existing research on healthcare visitation patterns often examines hospital attributes, population socioeconomics, and spatial factors in isolation, creating a fragmented understanding. This study aims to integrate these determinants to better predict visitation flows and analyze influencing factors.Method: Used four years of SafeGraph mobility data and Google Maps Reviews data to train five flow prediction models (Naive Regression, Gradient Boosting, MLPs, Deep Gravity, HGNN) for Houston, Texas. Applied SHAP analysis and Partial Dependence Plot methods to examine combined factor impacts.
Result: Deep Gravity outperformed other models. Hospital capacities, ICU occupancy rates, ratings, and popularity significantly influence visitation patterns with varying effects across travel distances. Short-distance visits driven by convenience, long-distance by hospital ratings. Demographic variations show White-majority areas less sensitive to ratings for short distances, while Asian and higher-educated populations prioritize ratings. Hispanic, Black, under-18, and over-65 populations have more frequent visits.
Conclusion: The integrated approach successfully reveals complex visitation patterns and demographic disparities in healthcare access, providing insights for healthcare planning and policy to address inequities in healthcare utilization.
Abstract: Healthcare visitation patterns are influenced by a complex interplay of hospital attributes, population socioeconomics, and spatial factors. However, existing research often adopts a fragmented approach, examining these determinants in isolation. This study addresses this gap by integrating hospital capacities, occupancy rates, reputation, and popularity with population SES and spatial mobility patterns to predict visitation flows and analyze influencing factors. Utilizing four years of SafeGraph mobility data and user experience data from Google Maps Reviews, five flow prediction models, Naive Regression, Gradient Boosting, Multilayer Perceptrons (MLPs), Deep Gravity, and Heterogeneous Graph Neural Networks (HGNN),were trained and applied to simulate visitation flows in Houston, Texas, U.S. The Shapley additive explanation (SHAP) analysis and the Partial Dependence Plot (PDP) method were employed to examine the combined impacts of different factors on visitation patterns. The findings reveal that Deep Gravity outperformed other models. Hospital capacities, ICU occupancy rates, ratings, and popularity significantly influence visitation patterns, with their effects varying across different travel distances. Short-distance visits are primarily driven by convenience, whereas long-distance visits are influenced by hospital ratings. White-majority areas exhibited lower sensitivity to hospital ratings for short-distance visits, while Asian populations and those with higher education levels prioritized hospital rating in their visitation decisions. SES further influence these patterns, as areas with higher proportions of Hispanic, Black, under-18, and over-65 populations tend to have more frequent hospital visits, potentially reflecting greater healthcare needs or limited access to alternative medical services.
[334] Partially Lazy Gradient Descent for Smoothed Online Learning
Naram Mhaisen, George Iosifidis
Main category: cs.LG
TL;DR: k-lazyGD bridges greedy OGD and lazy GD, achieving optimal dynamic regret in SOCO while allowing laziness proportional to comparator path length.
Details
Motivation: To create a spectrum between reactive (greedy OGD) and stable (lazy GD) updates in online learning, and to understand how laziness affects performance in Smoothed Online Convex Optimization where both hitting and movement costs matter.Method: Proposes k-lazyGD algorithm that interpolates between Online Gradient Descent (k=1) and lazy GD/dual-averaging (k=T). Uses Follow the Regularized Leader (FTRL) framework for analysis. Creates ensemble of learners with different slack parameters k to adapt to varying comparator dynamics.
Result: Proves k-lazyGD achieves optimal dynamic regret O(√((P_T+1)T)) for any laziness slack k up to Θ(√(T/P_T)), where P_T is comparator path length. Shows laziness doesn’t sacrifice hitting performance. Provides matching lower bound. Ensemble method adapts to be stable when possible and agile when necessary.
Conclusion: Laziness in online learning is possible without performance loss, with allowable laziness formally connected to comparator dynamics. The k-lazyGD framework provides adaptive trade-off between stability and agility in SOCO problems.
Abstract: We introduce $k$-lazyGD, an online learning algorithm that bridges the gap between greedy Online Gradient Descent (OGD, for $k=1$) and lazy GD/dual-averaging (for $k=T$), creating a spectrum between reactive and stable updates. We analyze this spectrum in Smoothed Online Convex Optimization (SOCO), where the learner incurs both hitting and movement costs. Our main contribution is establishing that laziness is possible without sacrificing hitting performance: we prove that $k$-lazyGD achieves the optimal dynamic regret $\mathcal{O}(\sqrt{(P_T+1)T})$ for any laziness slack $k$ up to $Θ(\sqrt{T/P_T})$, where $P_T$ is the comparator path length. This result formally connects the allowable laziness to the comparator’s shifts, showing that $k$-lazyGD can retain the inherently small movements of lazy methods without compromising tracking ability. We base our analysis on the Follow the Regularized Leader (FTRL) framework, and derive a matching lower bound. Since the slack depends on $P_T$, an ensemble of learners with various slacks is used, yielding a method that is provably stable when it can be, and agile when it must be.
[335] Data-Driven Conditional Flexibility Index
Moritz Wedemeyer, Eike Cramer, Alexander Mitsos, Manuel Dahmen
Main category: cs.LG
TL;DR: The paper introduces a Conditional Flexibility Index (CFI) that extends traditional flexibility analysis by learning data-driven, context-aware admissible uncertainty sets using normalizing flows, enabling more relevant flexibility assessment under specific conditions.
Details
Motivation: Traditional flexibility index uses simple geometric sets (like hypercubes) to approximate uncertainty regions without considering available contextual information (forecasts) or learning from historical data, limiting its informativeness for robust scheduling decisions.Method: Proposes CFI using normalizing flows to learn bijective mapping from Gaussian base distribution to data distribution. Constructs admissible latent uncertainty set as hypersphere in latent space, then maps to data space. Incorporates contextual information to make uncertainty sets conditional on specific conditions.
Result: No general superiority of data-driven over simple sets or conditional over unconditional sets, but both ensure only relevant uncertainty regions are considered. Applied to security-constrained unit commitment, CFI improves scheduling quality by incorporating temporal information.
Conclusion: CFI provides more informative flexibility estimates by learning context-aware admissible uncertainty sets from data, enabling better robust scheduling decisions that consider only relevant uncertainty regions under specific conditions.
Abstract: With the increasing flexibilization of processes, determining robust scheduling decisions has become an important goal. Traditionally, the flexibility index has been used to identify safe operating schedules by approximating the admissible uncertainty region using simple admissible uncertainty sets, such as hypercubes. Presently, available contextual information, such as forecasts, has not been considered to define the admissible uncertainty set when determining the flexibility index. We propose the conditional flexibility index (CFI), which extends the traditional flexibility index in two ways: by learning the parametrized admissible uncertainty set from historical data and by using contextual information to make the admissible uncertainty set conditional. This is achieved using a normalizing flow that learns a bijective mapping from a Gaussian base distribution to the data distribution. The admissible latent uncertainty set is constructed as a hypersphere in the latent space and mapped to the data space. By incorporating contextual information, the CFI provides a more informative estimate of flexibility by defining admissible uncertainty sets in regions that are more likely to be relevant under given conditions. Using an illustrative example, we show that no general statement can be made about data-driven admissible uncertainty sets outperforming simple sets, or conditional sets outperforming unconditional ones. However, both data-driven and conditional admissible uncertainty sets ensure that only regions of the uncertain parameter space containing realizations are considered. We apply the CFI to a security-constrained unit commitment example and demonstrate that the CFI can improve scheduling quality by incorporating temporal information.
[336] CLASP: An online learning algorithm for Convex Losses And Squared Penalties
Ricardo N. Ferreira, Cláudia Soares, João Xavier
Main category: cs.LG
TL;DR: CLASP algorithm for Constrained Online Convex Optimization achieves optimal regret and constraint violation bounds, with logarithmic guarantees for strongly convex problems.
Details
Motivation: Address the challenge of Constrained Online Convex Optimization where learners face both unanticipated convex losses and constraints, needing to minimize cumulative loss while controlling constraint violations.Method: Introduces CLASP (Convex Losses And Squared Penalties) algorithm that minimizes cumulative loss together with squared constraint violations, leveraging firm non-expansiveness of convex projectors.
Result: For convex losses: regret O(T^{max{β,1-β}}) and cumulative squared penalty O(T^{1-β}) for any β∈(0,1). For strongly convex problems: first logarithmic guarantees with both regret and cumulative squared penalty bounded by O(log T).
Conclusion: CLASP provides state-of-the-art performance for COCO, with novel proof techniques and the first logarithmic guarantees for strongly convex constrained online optimization problems.
Abstract: We study Constrained Online Convex Optimization (COCO), where a learner chooses actions iteratively, observes both unanticipated convex loss and convex constraint, and accumulates loss while incurring penalties for constraint violations. We introduce CLASP (Convex Losses And Squared Penalties), an algorithm that minimizes cumulative loss together with squared constraint violations. Our analysis departs from prior work by fully leveraging the firm non-expansiveness of convex projectors, a proof strategy not previously applied in this setting. For convex losses, CLASP achieves regret $O\left(T^{\max{β,1-β}}\right)$ and cumulative squared penalty $O\left(T^{1-β}\right)$ for any $β\in (0,1)$. Most importantly, for strongly convex problems, CLASP provides the first logarithmic guarantees on both regret and cumulative squared penalty. In the strongly convex case, the regret is upper bounded by $O( \log T )$ and the cumulative squared penalty is also upper bounded by $O( \log T )$.
[337] Explainable AI to Improve Machine Learning Reliability for Industrial Cyber-Physical Systems
Annemarie Jutte, Uraz Odyurt
Main category: cs.LG
TL;DR: Using XAI (SHAP values) to analyze time-series decomposition components reveals insufficient contextual information in ML models for industrial CPS, leading to improved performance by increasing data window size.
Details
Motivation: Industrial CPS are critical infrastructure requiring high reliability. ML models integrated into CPS are complex and non-transparent, needing rigorous evaluation to prevent unexpected behavior on unseen data. XAI can uncover model reasoning for better analysis.Method: Apply XAI (specifically SHAP values) to analyze effects of time-series data decomposition components on model predictions. Use XAI findings to identify lack of contextual information and increase window size of data instances during training.
Result: XAI analysis revealed evidence of insufficient contextual information during model training. By increasing the window size of data instances based on XAI findings, model performance was improved.
Conclusion: XAI can effectively diagnose limitations in ML models for industrial CPS, specifically identifying insufficient contextual information. Using XAI insights to adjust training data (increasing window size) leads to measurable performance improvements, demonstrating the value of explainability for enhancing CPS reliability.
Abstract: Industrial Cyber-Physical Systems (CPS) are sensitive infrastructure from both safety and economics perspectives, making their reliability critically important. Machine Learning (ML), specifically deep learning, is increasingly integrated in industrial CPS, but the inherent complexity of ML models results in non-transparent operation. Rigorous evaluation is needed to prevent models from exhibiting unexpected behaviour on future, unseen data. Explainable AI (XAI) can be used to uncover model reasoning, allowing a more extensive analysis of behaviour. We apply XAI to to improve predictive performance of ML models intended for industrial CPS. We analyse the effects of components from time-series data decomposition on model predictions using SHAP values. Through this method, we observe evidence on the lack of sufficient contextual information during model training. By increasing the window size of data instances, informed by the XAI findings, we are able to improve model performance.
[338] Probably Approximately Correct Maximum A Posteriori Inference
Matthew Shorvon, Frederik Mallmann-Trenn, David S. Watson
Main category: cs.LG
TL;DR: PAC algorithms for MAP inference with provable optimality guarantees under computational budgets, using information-theoretic measures and probabilistic circuits.
Details
Motivation: MAP estimation is fundamental but generally intractable, even with structural constraints and approximations. There's a need for algorithms with rigorous guarantees.Method: Introduce PAC-MAP algorithms using information-theoretic measures estimated from finite samples, implemented via probabilistic circuits with appropriate architectures. Develop randomization strategies for standalone use or to fortify existing heuristics.
Result: Characterize tractability conditions for PAC-MAP, provide provably optimal solutions under variable/fixed computational budgets, and demonstrate benefits across benchmarks.
Conclusion: PAC-MAP offers rigorous guarantees for MAP inference, bridging theoretical optimality with practical implementation through probabilistic circuits and randomization strategies.
Abstract: Computing the conditional mode of a distribution, better known as the $\mathit{maximum\ a\ posteriori}$ (MAP) assignment, is a fundamental task in probabilistic inference. However, MAP estimation is generally intractable, and remains hard even under many common structural constraints and approximation schemes. We introduce $\mathit{probably\ approximately\ correct}$ (PAC) algorithms for MAP inference that provide provably optimal solutions under variable and fixed computational budgets. We characterize tractability conditions for PAC-MAP using information theoretic measures that can be estimated from finite samples. Our PAC-MAP solvers are efficiently implemented using probabilistic circuits with appropriate architectures. The randomization strategies we develop can be used either as standalone MAP inference techniques or to improve on popular heuristics, fortifying their solutions with rigorous guarantees. Experiments confirm the benefits of our method in a range of benchmarks.
[339] Benchmarking Deep Learning Models for Raman Spectroscopy Across Open-Source Datasets
Adithya Sineesh, Akshita Kamsali
Main category: cs.LG
TL;DR: Systematic benchmark of five Raman-specific deep learning classifiers across three open-source datasets with unified training protocols.
Details
Motivation: Existing deep learning evaluations for Raman spectroscopy often lack direct comparisons between Raman-specific models on shared datasets, creating a need for systematic benchmarking.Method: Evaluated five representative Raman-specific deep learning architectures under unified training and hyperparameter tuning protocols across three open-source Raman datasets for standard evaluation, fine-tuning, and distribution-shift testing.
Result: Reported classification accuracies and macro-averaged F1 scores to provide fair and reproducible comparison of deep learning models for Raman spectral classification.
Conclusion: This study presents one of the first systematic benchmarks comparing multiple published Raman-specific deep learning classifiers, addressing the scarcity of direct comparisons in the field.
Abstract: Deep learning classifiers for Raman spectroscopy are increasingly reported to outperform classical chemometric approaches. However their evaluations are often conducted in isolation or compared against traditional machine learning methods or trivially adapted vision-based architectures that were not originally proposed for Raman spectroscopy. As a result, direct comparisons between existing deep learning models developed specifically for Raman spectral analysis on shared open-source datasets remain scarce. To the best of our knowledge, this study presents one of the first systematic benchmarks comparing three or more published Raman-specific deep learning classifiers across multiple open-source Raman datasets. We evaluate five representative deep learning architectures under a unified training and hyperparameter tuning protocol across three open-source Raman datasets selected to support standard evaluation, fine-tuning, and explicit distribution-shift testing. We report classification accuracies and macro-averaged F1 scores to provide a fair and reproducible comparison of deep learning models for Raman spectra based classification.
[340] Variable Splitting Binary Tree Models Based on Bayesian Context Tree Models for Time Series Segmentation
Yuta Nakahara, Shota Saito, Kohei Horinouchi, Koshi Shimada, Naoki Ichijo, Manabu Kobayashi, Toshiyasu Matsushima
Main category: cs.LG
TL;DR: A Bayesian variable splitting binary tree model for time series segmentation that uses logistic regression to represent split positions at arbitrary locations, enabling more compact tree representations.
Details
Motivation: To develop a more flexible time series segmentation method that can represent split positions at arbitrary locations within intervals, unlike previous Bayesian context tree models that had limitations in representing interval partitioning on the time domain.Method: Proposes a variable splitting binary tree (VSBT) model based on Bayesian context tree models, where interval partitioning is represented by recursive logistic regression models. Uses local variational approximation for logistic regression combined with the context tree weighting (CTW) algorithm for simultaneous estimation of split positions and tree depth.
Result: Numerical examples on synthetic data demonstrate the effectiveness of the proposed model and inference algorithm in time series segmentation.
Conclusion: The VSBT model provides a flexible approach to time series segmentation with arbitrary split positions and compact tree representations, with an effective inference algorithm for simultaneous parameter estimation.
Abstract: We propose a variable splitting binary tree (VSBT) model based on Bayesian context tree (BCT) models for time series segmentation. Unlike previous applications of BCT models, the tree structure in our model represents interval partitioning on the time domain. Moreover, interval partitioning is represented by recursive logistic regression models. By adjusting logistic regression coefficients, our model can represent split positions at arbitrary locations within each interval. This enables more compact tree representations. For simultaneous estimation of both split positions and tree depth, we develop an effective inference algorithm that combines local variational approximation for logistic regression with the context tree weighting (CTW) algorithm. We present numerical examples on synthetic data demonstrating the effectiveness of our model and algorithm.
[341] On the Intrinsic Dimensions of Data in Kernel Learning
Rustem Takhanov
Main category: cs.LG
TL;DR: The paper analyzes kernel ridge regression generalization using two intrinsic dimension measures: Minkowski dimension d_ρ and effective dimension d_K derived from Kolmogorov n-widths, showing d_K can be smaller than d_ρ on fractal domains.
Details
Motivation: To understand how the intrinsic dimension of input data affects generalization in kernel ridge regression, particularly investigating alternative dimension measures beyond standard Minkowski dimension.Method: Analyzes relationship between Kolmogorov n-widths and eigenvalues of kernel integral operators; derives generalization bounds; proposes algorithm to estimate n-widths from finite samples; computes effective dimensions for fractal sets.
Result: Shows Kolmogorov n-widths characterize worst-case eigenvalue decay; derives excess error bound O(n^{-(2+d_K)/(2+2d_K)+ε}); proves effective dimension d_K can be significantly smaller than Minkowski dimension d_ρ on fractal domains despite equality on regular domains.
Conclusion: Effective dimension d_K provides better generalization bounds than Minkowski dimension for kernel ridge regression on irregular domains, with practical estimation algorithms using finite samples.
Abstract: The manifold hypothesis suggests that the generalization performance of machine learning methods improves significantly when the intrinsic dimension of the input distribution’s support is low. In the context of KRR, we investigate two alternative notions of intrinsic dimension. The first, denoted $d_ρ$, is the upper Minkowski dimension defined with respect to the canonical metric induced by a kernel function $K$ on a domain $Ω$. The second, denoted $d_K$, is the effective dimension, derived from the decay rate of Kolmogorov $n$-widths associated with $K$ on $Ω$. Given a probability measure $μ$ on $Ω$, we analyze the relationship between these $n$-widths and eigenvalues of the integral operator $φ\to \int_ΩK(\cdot,x)φ(x)dμ(x)$. We show that, for a fixed domain $Ω$, the Kolmogorov $n$-widths characterize the worst-case eigenvalue decay across all probability measures $μ$ supported on $Ω$. These eigenvalues are central to understanding the generalization behavior of constrained KRR, enabling us to derive an excess error bound of order $O(n^{-\frac{2+d_K}{2+2d_K} + ε})$ for any $ε> 0$, when the training set size $n$ is large. We also propose an algorithm that estimates upper bounds on the $n$-widths using only a finite sample from $μ$. For distributions close to uniform, we prove that $ε$-accurate upper bounds on all $n$-widths can be computed with high probability using at most $O\left(ε^{-d_ρ}\log\frac{1}ε\right)$ samples, with fewer required for small $n$. Finally, we compute the effective dimension $d_K$ for various fractal sets and present additional numerical experiments. Our results show that, for kernels such as the Laplace kernel, the effective dimension $d_K$ can be significantly smaller than the Minkowski dimension $d_ρ$, even though $d_K = d_ρ$ provably holds on regular domains.
[342] Beat-ssl: Capturing Local ECG Morphology through Heartbeat-level Contrastive Learning with Soft Targets
Muhammad Ilham Rizqyawan, Peter Macfarlane, Stathis Hadjidemetriou, Fani Deligianni
Main category: cs.LG
TL;DR: Beat-SSL is a contrastive learning framework for ECG analysis that uses dual-context learning with rhythm-level and heartbeat-level contrasting and soft targets, achieving strong performance on multilabel classification and segmentation tasks.
Details
Motivation: Obtaining labeled ECG data is challenging for supervised models. Existing contrastive learning frameworks either focus only on global context or don't exploit ECG-specific characteristics, and they use hard contrastive targets that don't capture the continuous nature of ECG feature similarity.Method: Beat-SSL performs dual-context contrastive learning through both rhythm-level and heartbeat-level contrasting with soft targets. The framework learns representations across both global and local contexts specifically tailored to ECG characteristics.
Result: Beat-SSL reached 93% of the performance of an ECG foundation model (with broader pretraining) in multilabel classification, and surpassed all other methods including the foundation model by 4% in the segmentation task.
Conclusion: Beat-SSL effectively addresses limitations of existing contrastive learning methods for ECG analysis by incorporating ECG-specific dual-context learning with soft targets, enabling effective transfer learning with limited labeled data.
Abstract: Obtaining labelled ECG data for developing supervised models is challenging. Contrastive learning (CL) has emerged as a promising pretraining approach that enables effective transfer learning with limited labelled data. However, existing CL frameworks either focus solely on global context or fail to exploit ECG-specific characteristics. Furthermore, these methods rely on hard contrastive targets, which may not adequately capture the continuous nature of feature similarity in ECG signals. In this paper, we propose Beat-SSL, a contrastive learning framework that performs dual-context learning through both rhythm-level and heartbeat-level contrasting with soft targets. We evaluated our pretrained model on two downstream tasks: 1) multilabel classification for global rhythm assessment, and 2) ECG segmentation to assess its capacity to learn representations across both contexts. We conducted an ablation study and compared the best configuration with three other methods, including one ECG foundation model. Despite the foundation model’s broader pretraining, Beat-SSL reached 93% of its performance in multilabel classification task and surpassed all other methods in the segmentation task by 4%.
[343] Learning to Discover at Test Time
Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, Yu Sun
Main category: cs.LG
TL;DR: TTT-Discover uses reinforcement learning at test time to discover state-of-the-art solutions for scientific problems by continuously training LLMs on specific test problems rather than using frozen models.
Details
Motivation: Prior test-time scaling methods use frozen LLMs, but the authors want to enable LLMs to continue training with experience specific to each test problem, focusing on producing one great solution for that particular problem rather than average performance across many problems.Method: Test-Time Training to Discover (TTT-Discover) performs reinforcement learning at test time with learning objectives and search subroutines designed to prioritize the most promising solutions. It focuses on problems with continuous rewards and uses an open model (OpenAI gpt-oss-120b) with test-time training runs performed via Tinker API.
Result: TTT-Discover achieves state-of-the-art results across multiple domains: (i) Erdős’ minimum overlap problem and autocorrelation inequality, (ii) GPUMode kernel competition (up to 2× faster than prior art), (iii) past AtCoder algorithm competitions, and (iv) denoising problem in single-cell analysis. All results are reproducible with publicly available code.
Conclusion: Test-time training enables discovery of novel state-of-the-art solutions across diverse scientific domains using open models at low cost, outperforming previous methods that required closed frontier models.
Abstract: How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) Erdős’ minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to $2\times$ faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.
[344] Provable Robustness in Multimodal Large Language Models via Feature Space Smoothing
Song Xia, Meiwen Ding, Chenqi Kong, Wenhan Yang, Xudong Jiang
Main category: cs.LG
TL;DR: Feature-space Smoothing (FS) with Purifier and Smoothness Mapper (PSM) provides certified robustness guarantees for multimodal large language models against adversarial attacks by maintaining feature cosine similarity bounds.
Details
Motivation: Multimodal large language models (MLLMs) are powerful but vulnerable to adversarial perturbations that distort feature representations and cause erroneous predictions. Current methods lack theoretical robustness guarantees.Method: Proposes Feature-space Smoothing (FS) that transforms feature encoders into smoothed variants with certified cosine similarity bounds under ℓ₂-bounded attacks. Introduces Purifier and Smoothness Mapper (PSM) as a plug-and-play module to improve Gaussian robustness scores without retraining MLLMs.
Result: FS-PSM reduces Attack Success Rate (ASR) of various white-box attacks from nearly 90% to about 1% across diverse MLLMs and downstream tasks, outperforming adversarial training while providing theoretical robustness guarantees.
Conclusion: The proposed FS-PSM framework offers both strong theoretical certified robustness and superior empirical performance against adversarial attacks on MLLMs, providing a practical defense solution without requiring model retraining.
Abstract: Multimodal large language models (MLLMs) exhibit strong capabilities across diverse applications, yet remain vulnerable to adversarial perturbations that distort their feature representations and induce erroneous predictions. To address this vulnerability, we propose the Feature-space Smoothing (FS) and theoretically prove that FS offers certified robustness on the feature representations of MLLMs. Specifically, FS transforms any feature encoder into a smoothed variant that is guaranteed to maintain a certified lower bound on the feature cosine similarity between clean and adversarial representations under $\ell_2$-bounded attacks. Moreover, we indicate that the value of this Feature Cosine Similarity Bound (FCSB) derived from FS can be improved by enlarging the defined Gaussian robustness score on the vanilla encoder. Building upon this, we introduce the Purifier and Smoothness Mapper (PSM), a plug-and-play module that improves the Gaussian robustness score of MLLMs and thus enhances their certified robustness under FS, without requiring any retraining on MLLMs. We demonstrate that the FS with PSM not only provides a strong theoretical robustness guarantee but also exhibits superior empirical performance compared to adversarial training. Extensive experiments across diverse MLLMs and downstream tasks indicate the effectiveness of the FS-PSM, reducing the Attack Success Rate (ASR) of various white-box attacks from nearly 90% to about 1%.
[345] Counterfactual Training: Teaching Models Plausible and Actionable Explanations
Patrick Altmeyer, Aleksander Buszydlik, Arie van Deursen, Cynthia C. S. Liem
Main category: cs.LG
TL;DR: Counterfactual training uses counterfactual explanations during model training to improve explanatory capacity and adversarial robustness.
Details
Motivation: Current post-hoc counterfactual explanation methods focus on generating plausible and actionable explanations after training, but don't ensure models inherently produce good counterfactuals. The paper aims to make models directly accountable for producing desirable counterfactual explanations.Method: Proposes counterfactual training that leverages counterfactual explanations during the training phase to minimize divergence between learned representations and plausible, actionable explanations. This integrates counterfactual desiderata directly into the training objective rather than as a post-hoc process.
Result: Empirical and theoretical demonstrations show the method trains models that deliver inherently desirable counterfactual explanations and additionally exhibit improved adversarial robustness.
Conclusion: Counterfactual training represents a paradigm shift from post-hoc explanation methods to training regimes that directly optimize for explanatory capacity, resulting in models with better interpretability and robustness properties.
Abstract: We propose a novel training regime termed counterfactual training that leverages counterfactual explanations to increase the explanatory capacity of models. Counterfactual explanations have emerged as a popular post-hoc explanation method for opaque machine learning models: they inform how factual inputs would need to change in order for a model to produce some desired output. To be useful in real-world decision-making systems, counterfactuals should be plausible with respect to the underlying data and actionable with respect to the feature mutability constraints. Much existing research has therefore focused on developing post-hoc methods to generate counterfactuals that meet these desiderata. In this work, we instead hold models directly accountable for the desired end goal: counterfactual training employs counterfactuals during the training phase to minimize the divergence between learned representations and plausible, actionable explanations. We demonstrate empirically and theoretically that our proposed method facilitates training models that deliver inherently desirable counterfactual explanations and additionally exhibit improved adversarial robustness.
[346] Representation-Driven Reinforcement Learning
Ofir Nabati, Guy Tennenholtz, Shie Mannor
Main category: cs.LG
TL;DR: A representation-driven RL framework that treats policies as value estimates, using contextual bandit techniques to reframe exploration-exploitation as representation-exploitation for improved performance.
Details
Motivation: Traditional RL methods struggle with balancing exploration and exploitation. The paper aims to provide a new perspective by focusing on how policy representation can fundamentally determine optimal exploration-exploitation strategies.Method: Represent policies as estimates of their expected values, embed policy networks into linear feature spaces, and leverage contextual bandit techniques to transform exploration-exploitation into representation-exploitation. Applied to both evolutionary and policy gradient-based approaches.
Result: Significantly improved performance compared to traditional RL methods, demonstrating the effectiveness of the representation-driven framework across different algorithmic approaches.
Conclusion: Policy representation is crucial for determining optimal exploration-exploitation strategies in RL. The framework offers a new perspective that emphasizes representation quality as key to balancing exploration and exploitation effectively.
Abstract: We present a representation-driven framework for reinforcement learning. By representing policies as estimates of their expected values, we leverage techniques from contextual bandits to guide exploration and exploitation. Particularly, embedding a policy network into a linear feature space allows us to reframe the exploration-exploitation problem as a representation-exploitation problem, where good policy representations enable optimal exploration. We demonstrate the effectiveness of this framework through its application to evolutionary and policy gradient-based approaches, leading to significantly improved performance compared to traditional methods. Our framework provides a new perspective on reinforcement learning, highlighting the importance of policy representation in determining optimal exploration-exploitation strategies.
[347] Scalable Multi-view Clustering via Explicit Kernel Features Maps
Chakib Fettal, Lazhar Labiod, Mohamed Nadif
Main category: cs.LG
TL;DR: Proposes efficient multi-view clustering using explicit kernel feature maps and non-iterative optimization for large attributed networks with millions of points.
Details
Motivation: High-dimensional data proliferation from social media, sensor networks, and online platforms creates challenges for clustering algorithms. Multi-view clustering integrates complementary information but existing methods struggle with scalability and efficiency on large attributed networks.Method: Leverages explicit kernel feature maps and a non-iterative optimization strategy to enable efficient clustering.
Result: Enables efficient and accurate clustering on datasets with millions of points.
Conclusion: The approach addresses scalability limitations of existing multi-view clustering methods for large-scale attributed networks.
Abstract: The proliferation of high-dimensional data from sources such as social media, sensor networks, and online platforms has created new challenges for clustering algorithms. Multi-view clustering, which integrates complementary information from multiple data perspectives, has emerged as a powerful solution. However, existing methods often struggle with scalability and efficiency, particularly on large attributed networks. In this work, we address these limitations by leveraging explicit kernel feature maps and a non-iterative optimization strategy, enabling efficient and accurate clustering on datasets with millions of points.
[348] Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes
Steven Kolawole, Lucio Dery, Jean-François Kagy, Virginia Smith, Graham Neubig, Ameet Talwalkar
Main category: cs.LG
TL;DR: Bonsai is a gradient-free structured pruning method that eliminates backpropagation, reducing memory/compute costs while achieving state-of-the-art pruning performance and producing models twice as fast as semi-structured pruning.
Details
Motivation: Existing structured pruning methods rely on backward passes (gradients), which inflate memory requirements and compute costs, making it challenging to prune large models on memory-constrained hardware.Method: Bonsai uses forward-pass-only perturbative pruning, eliminating the need for backpropagation. It enables efficient compression of large models on a broader range of hardware configurations through gradient-free structured pruning.
Result: Bonsai achieves state-of-the-art pruning performance while reducing memory/compute costs. It can prune 7B and 8B models to 50% sparsity on a single A6000 GPU (challenging for backprop-based methods requiring 2-3x memory). Produces models twice as fast as those from semi-structured pruning.
Conclusion: Removing backpropagation as a requirement enables pruning larger models on constrained hardware while achieving state-of-the-art efficiency and performance, making structured pruning more accessible and practical.
Abstract: Structured pruning is a promising approach to create smaller, faster large language models. However, existing methods typically rely on computing the gradient via backward passes, which can inflate memory requirements and compute costs. In this work we introduce Bonsai, a gradient-free structured pruning method that eliminates the need for backpropagation, significantly reducing memory requirements and compute costs while achieving state-of-the-art pruning performance. Bonsai uses forward-pass-only perturbative pruning to enable efficient compression of large models on a broader range of hardware configurations. Unlike existing structured pruning approaches, Bonsai not only achieves better compression with fewer resources but also produces models that are twice as fast as those generated by semi-structured pruning. As a concrete demonstration, we use Bonsai to prune 7B and 8B models to 50% sparsity on a single A6000 GPU – a task challenging for backprop-based methods in memory-constrained settings, as they require 2-3x the memory. Our results show that removing backprop as a requirement not only enables pruning larger models on constrained hardware but can also lead to state-of-the-art efficiency and performance.
[349] Neural Green’s Operators for Parametric Partial Differential Equations
Hugo Melchers, Joost Prins, Michael Abdelmalik
Main category: cs.LG
TL;DR: Neural Green’s Operators (NGOs) are parametric neural operators derived from Green’s function representations of linear PDEs, preserving linear action while approximating nonlinear coefficient dependence with neural networks.
Details
Motivation: To develop neural operators that preserve mathematical properties of Green's functions while reducing learning complexity from full solution operators to just Green's function dependence on PDE coefficients.Method: Construct NGOs by preserving the linear action of Green’s operators on inhomogeneity fields while approximating the nonlinear dependence of Green’s functions on PDE coefficients using neural networks.
Result: NGOs achieve comparable or superior accuracy to existing operator networks with similar parameters, generalize better on out-of-distribution data, enable accurate auto-regressive time stepping, solve nonlinear PDEs via iterative solvers, and provide effective matrix preconditioners.
Conclusion: NGOs provide a mathematically grounded framework for neural operators that preserves Green’s function properties, reduces learning complexity, and enables various applications including time-dependent and nonlinear PDE solving with good generalization.
Abstract: This work introduces a paradigm for constructing parametric neural operators that are derived from finite-dimensional representations of Green’s operators for linear partial differential equations (PDEs). We refer to such neural operators as Neural Green’s Operators (NGOs). Our construction of NGOs preserves the linear action of Green’s operators on the inhomogeneity fields, while approximating the nonlinear dependence of the Green’s function on the coefficients of the PDE using neural networks. This construction reduces the complexity of the problem from learning the entire solution operator and its dependence on all parameters to only learning the Green’s function and its dependence on the PDE coefficients. Furthermore, we show that our explicit representation of Green’s functions enables the embedding of desirable mathematical attributes in our NGO architectures, such as symmetry, spectral, and conservation properties. Through numerical benchmarks on canonical PDEs, we demonstrate that NGOs achieve comparable or superior accuracy to Deep Operator Networks, Variationally Mimetic Operator Networks, and Fourier Neural Operators with similar parameter counts, while generalizing significantly better when tested on out-of-distribution data. For parametric time-dependent PDEs, we show that NGOs that are trained on a single time step can produce pointwise-accurate dynamics in an auto-regressive manner over arbitrarily large numbers of time steps. For parametric nonlinear PDEs, we demonstrate that NGOs trained exclusively on solutions of corresponding linear problems can be embedded within iterative solvers to yield accurate solutions, provided a suitable initial guess is available. Finally, we show that we can leverage the explicit representation of Green’s functions returned by NGOs to construct effective matrix preconditioners that accelerate iterative solvers for PDEs.
[350] On the Exponential Convergence for Offline RLHF with Pairwise Comparisons
Zhirui Chen, Vincent Y. F. Tan
Main category: cs.LG
TL;DR: RL-LOW algorithm achieves exponential simple regret in offline RLHF with pairwise comparisons, matching instance-dependent lower bounds and extending to differential privacy.
Details
Motivation: Existing offline RLHF with pairwise comparisons studies focus on worst-case inverse polynomial regret bounds (e.g., Õ(1/√n)), leaving a gap for instance-dependent exponential convergence analysis.Method: Proposed RL-LOW (RL with Locally Optimal Weights) algorithm for offline RLHF with pairwise comparisons where implicit reward is linear in unknown parameter. The method yields exponential simple regret exp(-Ω(n/H)), where H is instance-dependent hardness quantity based on suboptimality gaps.
Result: RL-LOW achieves exponential simple regret exp(-Ω(n/H)), matching order-wise instance-dependent lower bound, demonstrating optimality. Extended to (ε,δ)-differential privacy with unchanged hardness parameter H asymptotically, showing privacy efficiency.
Conclusion: RL-LOW fills research gap by providing instance-dependent exponential convergence bounds for offline RLHF with pairwise comparisons, achieving order-wise optimality and maintaining efficiency under privacy constraints.
Abstract: We consider the problem of offline reinforcement learning from human feedback (RLHF) with pairwise comparisons proposed by Zhu et al. (2023), where the implicit reward is a linear function of an unknown parameter. Given an offline dataset, our objective consists in ascertaining the optimal action for each state, with the ultimate goal of minimizing the {\em simple regret}. We propose an algorithm, \underline{RL} with \underline{L}ocally \underline{O}ptimal \underline{W}eights or {\sc RL-LOW}, which yields an exponential form of simple regret of $\exp ( - Ω(n/H) )$ where $n$ is the number of data samples and $H$ denotes an instance-dependent hardness quantity that depends explicitly on the suboptimality gap of each action. Furthermore, we derive a first-of-its-kind instance-dependent lower bound in offline RLHF with pairwise comparisons. Interestingly, we observe that the lower and upper bounds on the simple regret match order-wise in the exponent, demonstrating order-wise optimality of our {\sc RL-LOW}. In view of privacy considerations in practical applications, we also extend {\sc RL-LOW} to the setting of $(\varepsilon,δ)$-differential privacy and show, somewhat surprisingly, that the hardness parameter $H$ is unchanged in the asymptotic regime as $n$ tends to infinity; this underscores the inherent efficiency of {\sc RL-LOW} in terms of preserving the privacy of the observed rewards. Given our focus on establishing instance-dependent bounds of exponential convergence, our research fills the research gap in existing studies that concentrate on establishing worst-case regrets of {\em inverse polynomial convergence} (e.g., $\widetilde{O}(\frac{1}{\sqrt{n}})$) for offline RLHF with pairwise comparisons.
[351] ViSymRe: Vision-guided Multimodal Symbolic Regression
Da Li, Junping Yin, Jin Xu, Xinxin Li, Juan Zhang
Main category: cs.LG
TL;DR: ViSymRe is a vision-guided multimodal symbolic regression model that uses expression graphs as a third modality to bridge the gap between datasets and mathematical expressions, achieving better performance than dataset-only baselines.
Details
Motivation: Traditional symbolic regression models face efficiency and overfitting challenges. Recent LLM-based approaches struggle with modality gaps between input datasets and output mathematical expressions. There's a need for better multimodal approaches that can effectively bridge this gap without requiring expression graphs during inference.Method: ViSymRe introduces a vision-guided multimodal approach that incorporates expression graphs as a third modality to bridge the dataset-expression gap. It learns to extract “virtual vision” from datasets without relying on globally available expression graphs during inference, addressing the core challenge of visual symbolic regression.
Result: ViSymRe achieves more competitive performance than state-of-the-art dataset-only baselines on multiple mainstream benchmarks. The predicted expressions not only fit datasets well but are also simple and structurally accurate.
Conclusion: ViSymRe successfully addresses the modality gap in symbolic regression by incorporating expression graphs as a bridging modality while maintaining practical applicability by not requiring expression graphs during inference, representing an effective approach for extracting simple mathematical expressions from observational data.
Abstract: Extracting simple mathematical expression from an observational dataset to describe complex natural phenomena is one of the core objectives of artificial intelligence (AI). This field is known as symbolic regression (SR). Traditional SR models are based on genetic programming (GP) or reinforcement learning (RL), facing well-known challenges, such as low efficiency and overfitting. Recent studies have integrated SR with large language models (LLMs), enabling fast zero-shot inference by learning mappings from millions of dataset-expression pairs. However, since the input and output are inherently different modalities, such models often struggle to converge effectively. In this paper, we introduce ViSymRe, a vision-guided multimodal SR model that incorporates the third resource, expression graph, to bridge the modality gap. Different from traditional multimodal models, ViSymRe is trained to extract vision, termed virtual vision, from datasets, without relying on the global availability of expression graphs, which addresses the essential challenge of visual SR, i.e., expression graphs are not available during inference. Evaluation results on multiple mainstream benchmarks show that ViSymRe achieves more competitive performance than the state-of-the-art dataset-only baselines. The expressions predicted by ViSymRe not only fit the dataset well but are also simple and structurally accurate, goals that SR models strive to achieve.
[352] Data-driven tool wear prediction in milling, based on a process-integrated single-sensor approach
Eric Hirsch, Christian Friedrich
Main category: cs.LG
TL;DR: Deep learning models for tool wear prediction using minimal training data and single-sensor setup, achieving 99.1% accuracy with ConvNeXt model.
Details
Motivation: Traditional tool wear prediction methods require multi-sensor setups and extensive data, limiting generalization to new industrial settings. Need for low-cost, adaptable solutions with minimal training data.Method: Evaluated multiple ML models (CNN, LSTM, SVM, decision trees) trained on different input formats (feature vectors, STFT). Used single acceleration sensor setup and transfer learning across two processes. Tested with significantly reduced datasets.
Result: ConvNeXt model achieved exceptional 99.1% accuracy identifying tool wear using data from only four milling tools. Demonstrated effective transferability and performance under constrained data conditions.
Conclusion: Specific models like ConvNeXt can enable effective tool wear prediction with minimal data and simple sensor setups, supporting adaptable predictive maintenance strategies in industrial machining.
Abstract: Accurate tool wear prediction is essential for maintaining productivity and minimizing costs in machining. However, the complex nature of the tool wear process poses significant challenges to achieving reliable predictions. This study explores data-driven methods, in particular deep learning, for tool wear prediction. Traditional data-driven approaches often focus on a single process, relying on multi-sensor setups and extensive data generation, which limits generalization to new settings. Moreover, multi-sensor integration is often impractical in industrial environments. To address these limitations, this research investigates the transferability of predictive models using minimal training data, validated across two processes. Furthermore, it uses a simple setup with a single acceleration sensor to establish a low-cost data generation approach that facilitates the generalization of models to other processes via transfer learning. The study evaluates several machine learning models, including transformer-inspired convolutional neural networks (CNN), long short-term memory networks (LSTM), support vector machines (SVM), and decision trees, trained on different input formats such as feature vectors and short-time Fourier transform (STFT). The performance of the models is evaluated on two machines and on different amounts of training data, including scenarios with significantly reduced datasets, providing insight into their effectiveness under constrained data conditions. The results demonstrate the potential of specific models and configurations for effective tool wear prediction, contributing to the development of more adaptable and efficient predictive maintenance strategies in machining. Notably, the ConvNeXt model has an exceptional performance, achieving 99.1% accuracy in identifying tool wear using data from only four milling tools operated until they are worn.
[353] Explaining k-Nearest Neighbors: Abductive and Counterfactual Explanations
Pablo Barceló, Alexander Kozachinskiy, Miguel Romero Orth, Bernardo Subercaseaux, José Verschae
Main category: cs.LG
TL;DR: The paper analyzes the theoretical complexity of computing feature-based explanations (abductive and counterfactual) for k-NN classifiers, showing both positive and negative complexity results across different feature spaces and distance functions.
Details
Motivation: k-NN classifiers are widely used but their explainability properties are poorly understood theoretically. While data-based explanations (identifying nearest neighbors) are interpretable, they become impractical in high-dimensional settings where feature importance is unclear. The paper aims to understand k-NN through feature-based explanations instead.Method: The paper studies two types of feature-based explanations: 1) abductive explanations (minimum sufficient reasons - sets of features sufficient to guarantee classification), and 2) counterfactual explanations (minimum distance feature changes needed to change classification). The analysis considers discrete vs continuous feature spaces, different distance functions, and uses computational complexity theory.
Result: Presents a detailed landscape of positive and negative complexity results for computing both types of explanations. Despite some negative complexity results (NP-hardness), shows that Integer Quadratic Programming and SAT solving can compute explanations in practice.
Conclusion: Feature-based explanations for k-NN classifiers have complex computational properties that depend on feature space type and distance function. While theoretically challenging in some cases, practical computation is possible using optimization and constraint solving techniques.
Abstract: Despite the wide use of $k$-Nearest Neighbors as classification models, their explainability properties remain poorly understood from a theoretical perspective.
While nearest neighbors classifiers offer interpretability from a data perspective'', in which the classification of an input vector $\bar{x}$ is explained by identifying the vectors $\bar{v}_1, \ldots, \bar{v}_k$ in the training set that determine the classification of $\bar{x}$, we argue that such explanations can be impractical in high-dimensional applications, where each vector has hundreds or thousands of features and it is not clear what their relative importance is. Hence, we focus on understanding nearest neighbor classifications through a feature perspective’’, in which the goal is to identify how the values of the features in $\bar{x}$ affect its classification. Concretely, we study abductive explanations such as ``minimum sufficient reasons’’, which correspond to sets of features in $\bar{x}$ that are enough to guarantee its classification, and counterfactual explanations based on the minimum distance feature changes one would have to perform in $\bar{x}$ to change its classification. We present a detailed landscape of positive and negative complexity results for counterfactual and abductive explanations, distinguishing between discrete and continuous feature spaces, and considering the impact of the choice of distance function involved. Finally, we show that despite some negative complexity results, Integer Quadratic Programming and SAT solving allow for computing explanations in practice.
[354] Sparse Data Diffusion for Scientific Simulations in Biology and Physics
Phil Ostheimer, Mayank Nagda, Andriy Balinskyy, Jean Radig, Carl Herrmann, Stephan Mandt, Marius Kloft, Sophie Fellenz
Main category: cs.LG
TL;DR: SDD is a diffusion model that explicitly models exact zeros in sparse scientific data via Sparsity Bits, achieving higher fidelity than baselines in particle physics and single-cell biology applications.
Details
Motivation: Existing diffusion models lack physical rigor to faithfully represent sparse data where exact zeros encode physical absence rather than weak signal, which is fundamental to scientific simulations in biology and physics.Method: Introduces Sparse Data Diffusion (SDD) that explicitly models exact zeros via Sparsity Bits, unifying efficient ML generation with physically grounded sparsity handling.
Result: Empirical validation in particle physics and single-cell biology demonstrates that SDD achieves higher fidelity than baseline methods in capturing sparse patterns critical for scientific analysis.
Conclusion: SDD advances scalable and physically faithful simulation by properly handling sparsity in scientific data, enabling more accurate generative modeling for scientific applications.
Abstract: Sparse data is fundamental to scientific simulations in biology and physics, from single-cell gene expression to particle calorimetry, where exact zeros encode physical absence rather than weak signal. However, existing diffusion models lack the physical rigor to faithfully represent this sparsity. This work introduces Sparse Data Diffusion (SDD), a generative method that explicitly models exact zeros via Sparsity Bits, unifying efficient ML generation with physically grounded sparsity handling. Empirical validation in particle physics and single-cell biology demonstrates that SDD achieves higher fidelity than baseline methods in capturing sparse patterns critical for scientific analysis, advancing scalable and physically faithful simulation.
[355] ImputeGAP: A Comprehensive Library for Time Series Imputation
Quentin Nater, Mourad Khayati
Main category: cs.LG
TL;DR: ImputeGAP is a comprehensive Python library for time series imputation that supports diverse methods, realistic missingness simulation, and downstream evaluation.
Details
Motivation: Sensor failures create missing values in time series data, but existing libraries have limited imputation support, lack realistic missingness simulation, and don't account for downstream analysis impact.Method: Developed ImputeGAP library with modular missing data simulation, diverse imputation algorithms, automated hyperparameter tuning, benchmarking, explainability features, and compatibility with popular time series frameworks.
Result: Created a comprehensive imputation library that addresses limitations of existing tools by supporting various imputation methods, realistic missingness patterns, and downstream evaluation capabilities.
Conclusion: ImputeGAP provides a robust solution for time series imputation that bridges the gap between imputation methods and practical data analysis needs, offering extensive customization and evaluation features.
Abstract: With the prevalence of sensor failures, imputation, the process of estimating missing values, has emerged as the cornerstone of time series data pre-processing. While numerous imputation algorithms have been developed to repair these data gaps, existing time series libraries provide limited imputation support. Furthermore, they often lack the ability to simulate realistic time series missingness patterns and fail to account for the impact of the imputed data on subsequent downstream analysis. This paper introduces ImputeGAP, a comprehensive library for time series imputation that supports a diverse range of imputation methods and modular missing data simulation, catering to datasets with varying characteristics. The library includes extensive customization options, such as automated hyperparameter tuning, benchmarking, explainability, downstream evaluation, and compatibility with popular time series frameworks.
[356] On shallow feedforward neural networks with inputs from a topological space
Vugar Ismailov
Main category: cs.LG
TL;DR: Shallow feedforward neural networks with topological inputs (TFNNs) can approximate any continuous function on topological spaces, extending to an approximative version of Kolmogorov’s superposition theorem for compact metric spaces.
Details
Motivation: To extend neural network theory beyond Euclidean spaces by studying feedforward networks that can handle inputs from arbitrary topological spaces, and to establish their universal approximation capabilities in this broader context.Method: Study feedforward neural networks with inputs from topological spaces (TFNNs), prove a universal approximation theorem for shallow TFNNs, and apply the results to obtain an approximative version of Kolmogorov’s superposition theorem for compact metric spaces.
Result: Proved that shallow TFNNs have the capacity to approximate any continuous function defined on topological spaces, and derived an approximative version of Kolmogorov’s superposition theorem as an application for compact metric spaces.
Conclusion: Feedforward neural networks can be extended to handle inputs from topological spaces while maintaining universal approximation properties, providing theoretical foundations for neural networks operating on non-Euclidean data structures.
Abstract: We study feedforward neural networks with inputs from a topological space (TFNNs). We prove a universal approximation theorem for shallow TFNNs, which demonstrates their capacity to approximate any continuous function defined on this topological space. As an application, we obtain an approximative version of Kolmogorov’s superposition theorem for compact metric spaces.
[357] Adaptively Point-weighting Curriculum Learning
Wensheng Li, Yichao Tian, Hao Wang, Ruifeng Zhou, Hanting Guan, Chao Zhang, Dacheng Tao
Main category: cs.LG
TL;DR: APW curriculum learning adaptively weights samples based on training loss, following easy-to-hard paradigm guided by current training state, with theoretical analysis and experimental validation.
Details
Motivation: Existing automatic curriculum learning methods maintain preference for easy samples throughout training regardless of evolving training state, similar to human curriculum failing to provide individualized instruction, which can delay learning progress.Method: Adaptively point-weighting (APW) curriculum learning method assigns weight to each training sample based on its training loss, following easy-to-hard training paradigm guided by current network training state.
Result: Theoretical analysis covers training effectiveness, stability, and generalization performance; experimental results validate theoretical findings and demonstrate superiority of APW method.
Conclusion: APW addresses limitations of existing CL methods by providing adaptive sample weighting based on current training state, improving learning progress through individualized instruction-like approach.
Abstract: Curriculum learning (CL) mimics human learning, in which easy samples are learned first, followed by harder samples, and has become an effective method for training deep networks. However, many existing automatic CL methods maintain a preference for easy samples during the entire training process regardless of the constantly evolving training state. This is just like a human curriculum that fails to provide individualized instruction, which can delay learning progress. To address this issue, we propose an adaptively point-weighting (APW) curriculum learning method that assigns a weight to each training sample based on its training loss. The weighting strategy of APW follows the easy-to-hard training paradigm, guided by the current training state of the network. We present a theoretical analysis of APW, including training effectiveness, training stability, and generalization performance. Experimental results validate these theoretical findings and demonstrate the superiority of the proposed APW method.
[358] PyTDC: A multimodal machine learning training, evaluation, and inference platform for biomedical foundation models
Alejandro Velez-Arce, Jesus Caraballo, Marinka Zitnik
Main category: cs.LG
TL;DR: PyTDC is an open-source ML platform for multimodal biological AI models that unifies data sources, model weights, and standardizes benchmarking, with a case study showing current methods struggle with single-cell drug-target nomination tasks.
Details
Motivation: Existing biomedical benchmarks lack end-to-end infrastructure for training, evaluation, and inference of models that integrate multimodal biological data across various ML tasks in therapeutics.Method: Developed PyTDC platform with streamlined training, evaluation, and inference software that unifies distributed, heterogeneous, continuously updated data sources and model weights, and standardizes benchmarking endpoints.
Result: State-of-the-art graph representation learning and domain-specific graph theory methods perform poorly on single-cell drug-target nomination tasks. A context-aware geometric deep learning method outperforms baselines but fails to generalize to unseen cell types or incorporate additional modalities.
Conclusion: PyTDC demonstrates capacity to facilitate research on multimodal, context-aware foundation models for biomedical AI, highlighting current limitations in generalizability and multimodal integration that need to be addressed.
Abstract: Existing biomedical benchmarks do not provide end-to-end infrastructure for training, evaluation, and inference of models that integrate multimodal biological data and a broad range of machine learning tasks in therapeutics. We present PyTDC, an open-source machine-learning platform providing streamlined training, evaluation, and inference software for multimodal biological AI models. PyTDC unifies distributed, heterogeneous, continuously updated data sources and model weights and standardizes benchmarking and inference endpoints. This paper discusses the components of PyTDC’s architecture and, to our knowledge, the first-of-its-kind case study on the introduced single-cell drug-target nomination ML task. We find state-of-the-art methods in graph representation learning and domain-specific methods from graph theory perform poorly on this task. Though we find a context-aware geometric deep learning method that outperforms the evaluated SoTA and domain-specific baseline methods, the model is unable to generalize to unseen cell types or incorporate additional modalities, highlighting PyTDC’s capacity to facilitate an exciting avenue of research developing multimodal, context-aware, foundation models for open problems in biomedical AI.
[359] GenPO: Generative Diffusion Models Meet On-Policy Reinforcement Learning
Shutong Ding, Ke Hu, Shan Zhong, Haoyang Luo, Weinan Zhang, Jingya Wang, Jun Wang, Ye Shi
Main category: cs.LG
TL;DR: GenPO integrates diffusion policies into on-policy RL (like PPO) using exact diffusion inversion for invertible action mappings, enabling log-likelihood computation and entropy regularization in large-scale parallel simulators.
Details
Motivation: Diffusion policies show strong exploration and multimodality but haven't been integrated into on-policy RL frameworks like PPO, which are widely used with large-scale parallel GPU simulators (IsaacLab). The key challenge is computing state-action log-likelihoods for diffusion policies, which is intractable due to irreversible forward-reverse processes.Method: Proposes GenPO framework that uses exact diffusion inversion to construct invertible action mappings. Introduces a novel doubled dummy action mechanism enabling invertibility via alternating updates. Uses action log-likelihood for unbiased entropy and KL divergence estimation, enabling KL-adaptive learning rates and entropy regularization in on-policy updates.
Result: Extensive experiments on eight IsaacLab benchmarks (legged locomotion, dexterous manipulation, aerial control, robotic arm tasks) demonstrate GenPO’s superiority over existing RL baselines. GenPO is the first method to successfully integrate diffusion policies into on-policy RL.
Conclusion: GenPO bridges the gap between diffusion policies and on-policy RL, unlocking their potential for large-scale parallelized training and real-world robotic deployment by solving the log-likelihood computation barrier through invertible action mappings.
Abstract: Recent advances in reinforcement learning (RL) have demonstrated the powerful exploration capabilities and multimodality of generative diffusion-based policies. While substantial progress has been made in offline RL and off-policy RL settings, integrating diffusion policies into on-policy frameworks like PPO remains underexplored. This gap is particularly significant given the widespread use of large-scale parallel GPU-accelerated simulators, such as IsaacLab, which are optimized for on-policy RL algorithms and enable rapid training of complex robotic tasks. A key challenge lies in computing state-action log-likelihoods under diffusion policies, which is straightforward for Gaussian policies but intractable for flow-based models due to irreversible forward-reverse processes and discretization errors (e.g., Euler-Maruyama approximations). To bridge this gap, we propose GenPO, a generative policy optimization framework that leverages exact diffusion inversion to construct invertible action mappings. GenPO introduces a novel doubled dummy action mechanism that enables invertibility via alternating updates, resolving log-likelihood computation barriers. Furthermore, we also use the action log-likelihood for unbiased entropy and KL divergence estimation, enabling KL-adaptive learning rates and entropy regularization in on-policy updates. Extensive experiments on eight IsaacLab benchmarks, including legged locomotion (Ant, Humanoid, Anymal-D, Unitree H1, Go2), dexterous manipulation (Shadow Hand), aerial control (Quadcopter), and robotic arm tasks (Franka), demonstrate GenPO’s superiority over existing RL baselines. Notably, GenPO is the first method to successfully integrate diffusion policies into on-policy RL, unlocking their potential for large-scale parallelized training and real-world robotic deployment.
[360] KEPLA: A Knowledge-Enhanced Deep Learning Framework for Accurate Protein-Ligand Binding Affinity Prediction
Han Liu, Keyan Ding, Peilin Chen, Yinwei Wei, Liqiang Nie, Dapeng Wu, Shiqi Wang
Main category: cs.LG
TL;DR: KEPLA is a deep learning framework that integrates Gene Ontology and ligand property knowledge to improve protein-ligand binding affinity prediction beyond just structural features.
Details
Motivation: Current deep learning approaches for protein-ligand binding affinity prediction rely mainly on structural features, overlooking valuable biochemical knowledge that could enhance prediction accuracy and provide better insights.Method: KEPLA takes protein sequences and ligand molecular graphs as input and optimizes two objectives: 1) aligning global representations with knowledge graph relations (Gene Ontology and ligand properties) to capture biochemical insights, and 2) using cross attention between local representations to build fine-grained joint embeddings for prediction.
Result: Experiments on two benchmark datasets show KEPLA consistently outperforms state-of-the-art baselines in both in-domain and cross-domain scenarios. Interpretability analyses based on knowledge graph relations and cross attention maps provide valuable insights into predictive mechanisms.
Conclusion: Integrating prior biochemical knowledge with structural features significantly improves protein-ligand binding affinity prediction, and KEPLA’s interpretability features offer valuable insights for drug discovery applications.
Abstract: Accurate prediction of protein-ligand binding affinity is critical for drug discovery. While recent deep learning approaches have demonstrated promising results, they often rely solely on structural features of proteins and ligands, overlooking their valuable biochemical knowledge associated with binding affinity. To address this limitation, we propose KEPLA, a novel deep learning framework that explicitly integrates prior knowledge from Gene Ontology and ligand properties to enhance prediction performance. KEPLA takes protein sequences and ligand molecular graphs as input and optimizes two complementary objectives: (1) aligning global representations with knowledge graph relations to capture domain-specific biochemical insights, and (2) leveraging cross attention between local representations to construct fine-grained joint embeddings for prediction. Experiments on two benchmark datasets across both in-domain and cross-domain scenarios demonstrate that KEPLA consistently outperforms state-of-the-art baselines. Furthermore, interpretability analyses based on knowledge graph relations and cross attention maps provide valuable insights into the underlying predictive mechanisms.
[361] Training-Free Geospatial Place Representation Learning from Large-Scale Point-of-Interest Graph Data
Mohammad Hashemi, Hossein Amiri, Andreas Zufle
Main category: cs.LG
TL;DR: PlaceRep is a training-free geospatial representation learning method that clusters POIs into semantically meaningful places across spatial scales, outperforming existing methods while being 100x faster.
Details
Motivation: Existing geospatial representation learning methods aggregate POIs into fixed administrative boundaries, but POIs form semantically meaningful groups that extend across these boundaries, better reflecting human activity and urban function.Method: PlaceRep constructs place-level representations by clustering spatially and semantically related POIs from large-scale POI graphs (using U.S. Foursquare data), producing general-purpose urban region embeddings without model pre-training.
Result: PlaceRep outperforms most state-of-the-art graph-based geospatial representation learning methods on population density estimation and housing price prediction tasks, with up to 100x speedup in generating region-level representations.
Conclusion: PlaceRep provides a scalable, efficient, training-free solution for multi-granular geospatial analysis by automatically identifying semantically meaningful places across spatial scales, offering better performance and computational efficiency.
Abstract: Learning effective representations of urban environments requires capturing spatial structure beyond fixed administrative boundaries. Existing geospatial representation learning approaches typically aggregate Points of Interest(POI) into pre-defined administrative regions such as census units or ZIP code areas, assigning a single embedding to each region. However, POIs often form semantically meaningful groups that extend across, within, or beyond these boundaries, defining places that better reflect human activity and urban function. To address this limitation, we propose PlaceRep, a training-free geospatial representation learning method that constructs place-level representations by clustering spatially and semantically related POIs. PlaceRep summarizes large-scale POI graphs from U.S. Foursquare data to produce general-purpose urban region embeddings while automatically identifying places across multiple spatial scales. By eliminating model pre-training, PlaceRep provides a scalable and efficient solution for multi-granular geospatial analysis. Experiments using the tasks of population density estimation and housing price prediction as downstream tasks show that PlaceRep outperforms most state-of-the-art graph-based geospatial representation learning methods and achieves up to a 100x speedup in generating region-level representations on large-scale POI graphs. The implementation of PlaceRep is available at https://github.com/mohammadhashemii/PlaceRep.
[362] Can Language Models Discover Scaling Laws?
Haowei Lin, Haotian Ye, Wenzheng Feng, Quzhe Huang, Yujun Li, Hubert Lim, Zhengrui Li, Xiangyu Wang, Jianzhu Ma, Yitao Liang, James Zou
Main category: cs.LG
TL;DR: SLDAgent is an evolution-based AI agent that can automatically discover scaling laws for predicting model performance, outperforming human-derived formulas across 8 diverse tasks.
Details
Motivation: Discovering scaling laws for predicting model performance at scale is currently a slow, case-specific human experimentation process. The paper investigates whether LLMs can automate this process to overcome the limitations of manual scaling law discovery.Method: The authors collected over 5,000 experiments from existing literature and curated 8 diverse scaling law discovery tasks. They introduced SLDAgent, an evolution-based agent that co-optimizes both the scaling law model structure and its parameters, enabling autonomous exploration of complex variable relationships.
Result: SLDAgent discovered scaling laws that consistently exhibit more accurate extrapolation than established human-derived counterparts across all 8 tasks. The discovered laws were verified to be practically useful in both pretraining and finetuning applications.
Conclusion: This work establishes a new paradigm for agentic scientific discovery, demonstrating that AI systems can understand their own scaling behavior and contribute novel, practical knowledge back to the research community.
Abstract: Discovering scaling laws for predicting model performance at scale is a fundamental and open-ended challenge, mostly reliant on slow, case specific human experimentation. To investigate the potential for LLMs to automate this process, we collect over 5,000 experiments from existing literature and curate eight diverse scaling law discovery tasks. While existing agents struggle to produce accurate law formulas, this paper introduces SLDAgent, an evolution-based agent that co-optimize the scaling law model and the parameters, enabling it to autonomously explore complex relationships between variables. For the first time, we demonstrates that SLDAgent can automatically discover laws that exhibit consistently more accurate extrapolation than their established, human-derived counterparts across all tasks. Through comprehensive analysis, we elucidate why these discovered laws are superior and verify their practical utility in both pretraining and finetuning applications. This work establishes a new paradigm for agentic scientific discovery, showing that AI systems can understand their own scaling behavior, and can contribute novel and practical knowledge back to the research community.
[363] Stability, Complexity and Data-Dependent Worst-Case Generalization Bounds
Mario Tuci, Lennart Bastian, Benjamin Dupuis, Nassir Navab, Tolga Birdal, Umut Şimşekli
Main category: cs.LG
TL;DR: The paper introduces random set stability to bound generalization error for stochastic optimization algorithms using empirically relevant complexity measures, avoiding intractable mutual information terms.
Details
Motivation: Existing generalization bounds for stochastic optimization algorithms either involve intractable mutual information terms or impractical combinatorial geometric quantities, limiting understanding and practical application.Method: Introduces random set stability framework tailored for data-dependent random sets from stochastic optimization. Combines stability parameter with empirically relevant, data- and algorithm-dependent complexity measures of random sets.
Result: Shows worst-case generalization error can be bounded by random set stability parameter and empirically relevant complexity measures. Framework improves existing topological bounds by recovering previous complexity notions without mutual information terms.
Conclusion: The random set stability framework addresses limitations of existing approaches by combining empirically relevant complexity measures with tractable stability analysis, validated through experiments in practical settings.
Abstract: Providing generalization guarantees for stochastic optimization algorithms remains a key challenge in learning theory. Recently, numerous works demonstrated the impact of the geometric properties of optimization trajectories on generalization performance. These works propose worst-case generalization bounds in terms of various notions of intrinsic dimension and/or topological complexity, which were found to empirically correlate with the generalization error. However, most of these approaches involve intractable mutual information terms, which limit a full understanding of the bounds. In contrast, some authors built on algorithmic stability to obtain worst-case bounds involving geometric quantities of a combinatorial nature, which are impractical to compute. In this paper, we address these limitations by combining empirically relevant complexity measures with a framework that avoids intractable quantities. To this end, we introduce the concept of \emph{random set stability}, tailored for the data-dependent random sets produced by stochastic optimization algorithms. Within this framework, we show that the worst-case generalization error can be bounded in terms of (i) the random set stability parameter and (ii) empirically relevant, data- and algorithm-dependent complexity measures of the random set. Moreover, our framework improves existing topological generalization bounds by recovering previous complexity notions without relying on mutual information terms. Through a series of experiments in practically relevant settings, we validate our theory by evaluating the tightness of our bounds and the interplay between topological complexity and stability.
[364] Toward Robust Semi-supervised Regression via Dual-stream Knowledge Distillation
Ye Su, Hezhe Qiao, Wei Huang, Lin Chen
Main category: cs.LG
TL;DR: Dual-stream Knowledge Distillation (DKD) framework for semi-supervised regression that distills both continuous-valued knowledge and distribution information to improve sample efficiency and handle noisy pseudo-labels.
Details
Motivation: Existing semi-supervised regression methods fail to fully exploit unlabeled data and are sensitive to pseudo-label quality. Current approaches either use constraint-based regularization or consistency-driven pseudo-labeling, but these have limitations in handling noisy predictions and preserving regression magnitude information.Method: Proposes DKD with teacher-student architecture: teacher optimized solely with ground-truth labels for label distribution estimation; student learns from mixture of real labels and teacher-generated pseudo targets on unlabeled data. Introduces Decoupled Distribution Alignment (DDA) to align target and non-target class distributions between teacher and student.
Result: The framework enables effective supervision transfer, allowing student to leverage pseudo labels more robustly. DDA enhances student’s capacity to mitigate noise in pseudo-label supervision and learn better calibrated regression predictors.
Conclusion: DKD addresses limitations of existing SSR methods by better exploiting unlabeled data through dual-stream knowledge distillation and distribution alignment, improving robustness to noisy pseudo-labels and preserving regression magnitude information.
Abstract: Semi-supervised regression (SSR), which aims to predict continuous scores of samples while reducing reliance on a large amount of labeled data, has recently received considerable attention across various applications, including computer vision, natural language processing, and audio and medical analysis. Existing SSR methods typically train models on scarce labeled data by introducing constraint-based regularization or ordinal ranking to reduce overfitting. However, these approaches fail to fully exploit the abundance of unlabeled samples. While consistency-driven pseudo-labeling methods attempt to incorporate unlabeled data, they are highly sensitive to pseudo-label quality and noisy predictions. To address these challenges, we introduce a Dual-stream Knowledge Distillation framework (DKD), which is specially designed for the SSR task to distill knowledge from both continuous-valued knowledge and distribution information, better preserving regression magnitude information and improving sample efficiency. Specifically, in DKD, the teacher is optimized solely with ground-truth labels for label distribution estimation, while the student learns from a mixture of real labels and teacher-generated pseudo targets on unlabeled data. The distillation design ensures the effective supervision transfer, allowing the student to leverage pseudo labels more robustly. Then, we introduce an advanced Decoupled Distribution Alignment (DDA) to align the target class and non-target class between teacher and student on the distribution, enhancing the student’s capacity to mitigate noise in pseudo-label supervision and learn a more well-calibrated regression predictor.
[365] StoxLSTM: A Stochastic Extended Long Short-Term Memory Network for Time Series Forecasting
Zihao Wang, Yunjie Li, Lingmin Zan, Zheng Gong, Mengtao Zhu
Main category: cs.LG
TL;DR: StoxLSTM enhances xLSTM with stochastic latent variables in a state space framework to better model uncertainty and complex dynamics in time series.
Details
Motivation: Standard xLSTM's deterministic architecture limits its ability to handle real-world time series with inherent uncertainty, stochasticity, and complex hierarchical latent dynamics.Method: Integrates latent stochastic variables directly into xLSTM recurrent units within a state space modeling framework using an efficient non-autoregressive generative approach.
Result: StoxLSTM consistently outperforms state-of-the-art baselines on benchmark datasets, achieving superior performance and generalization.
Conclusion: The stochastic extension of xLSTM effectively models uncertainty and complex temporal dynamics while maintaining architectural simplicity.
Abstract: The Extended Long Short-Term Memory (xLSTM) network has demonstrated strong capability in modeling complex long-term dependencies in time series data. Despite its success, the deterministic architecture of xLSTM limits its representational capacity and forecasting performance, especially on challenging real-world time series datasets characterized by inherent uncertainty, stochasticity, and complex hierarchical latent dynamics. In this work, we propose StoxLSTM, a stochastic xLSTM within a designed state space modeling framework, which integrates latent stochastic variables directly into the recurrent units to effectively model deep latent temporal dynamics and uncertainty. The designed state space model follows an efficient non-autoregressive generative approach, achieving strong predictive performance without complex modifications to the original xLSTM architecture. Extensive experiments on publicly available benchmark datasets demonstrate that StoxLSTM consistently outperforms state-of-the-art baselines, achieving superior performance and generalization.
[366] FedIA: Towards Domain-Robust Aggregation in Federated Graph Learning
Zhanting Zhou, KaHou Tam, Yiding Feng, Ziqiang Zheng, Zeyu Ma, Yang Yang
Main category: cs.LG
TL;DR: FedIA addresses structural orthogonality in federated graph learning by using global importance masking and confidence-aware momentum weighting to reconcile domain-specific update conflicts without extra communication.
Details
Motivation: Federated Graph Learning suffers from cross-silo domain shifts where different domains have distinct graph topologies, causing divergent optimization trajectories and global model divergence. The paper identifies "Structural Orthogonality" where GNN gradients from different domains target disjoint parameter coordinates, leading to "Consensus Collapse" where averaging dilutes informative structural signals.Method: FedIA is a lightweight server-side framework with two stages: 1) Global Importance Masking (GIM) identifies a shared parameter subspace to filter out domain-specific structural noise and prevent signal dilution; 2) Confidence-Aware Momentum Weighting (CAM) dynamically re-weights client contributions based on gradient reliability to amplify stable optimization signals.
Result: The method addresses the severe architectural pathology of structural orthogonality where GNN updates become near-perpendicular across domains (projection ratios → 0), preventing consensus collapse and enabling better representation of domain-specific structural patterns.
Conclusion: FedIA provides an effective solution to reconcile update conflicts in federated graph learning without requiring auxiliary communication, improving generalization by preventing the dilution of informative structural signals from individual domains.
Abstract: Federated Graph Learning (FGL) enables a central server to coordinate model training across distributed clients without local graph data being shared. However, FGL significantly suffers from cross-silo domain shifts, where each “silo” (domain) contains a limited number of clients with distinct graph topologies. These heterogeneities induce divergent optimization trajectories, ultimately leading to global model divergence. In this work, we reveal a severe architectural pathology termed Structural Orthogonality: the topology-dependent message passing mechanism forces gradients from different domains to target disjoint coordinates in the parameter space. Through a controlled comparison between backbones, we statistically prove that GNN updates are near-perpendicular across domains (with projection ratios $\to$ 0). Consequently, naive averaging leads to Consensus Collapse, a phenomenon where sparse, informative structural signals from individual domains are diluted by the near-zero updates of others. This forces the global model into a “sub-optimal” state that fails to represent domain-specific structural patterns, resulting in poor generalization. To address this, we propose FedIA, a lightweight server-side framework designed to reconcile update conflicts without auxiliary communication. FedIA operates in two stages: (1) Global Importance Masking (GIM) identifies a shared parameter subspace to filter out domain-specific structural noise and prevent signal dilution; (2) Confidence-Aware Momentum Weighting (CAM) dynamically re-weights client contributions based on gradient reliability to amplify stable optimization signals.
[367] EfficientXpert: Efficient Domain Adaptation for Large Language Models via Propagation-Aware Pruning
Songlin Zhao, Michael Pitts, Zhuwei Qin
Main category: cs.LG
TL;DR: EfficientXpert is a lightweight domain pruning framework that converts general pretrained LLMs into sparse, domain-adapted experts with minimal computational overhead, achieving near-dense performance at high sparsity levels.
Details
Motivation: Domain-specific LLM variants are needed for applications like law, healthcare, and finance, but their large scale limits deployment in resource-constrained settings. Existing compression approaches either degrade after domain adaptation or require substantial additional computation.Method: EfficientXpert integrates two key components: 1) ForeSight Mask - a propagation-aware criterion for selecting weights to prune without backpropagation, and 2) Partial Brain Surgeon - an efficient closed-form update for low-rank adapters under a fixed sparsity pattern. The framework converts models in a single pruning step with fine-tuning cost comparable to standard LoRA.
Result: Across health and legal benchmarks, EfficientXpert reaches up to 98% of dense performance at 40% sparsity, improving over prior pruning baselines while matching LoRA training time and staying within 1% of LoRA peak GPU memory.
Conclusion: EfficientXpert provides an efficient solution for creating domain-specific LLM variants that maintain high performance while being deployable in resource-constrained environments, addressing the computational challenges of domain adaptation and compression.
Abstract: Large language models (LLMs) are increasingly adapted into domain-specific variants for applications in law, healthcare, and finance. Their scale, however, limits deployment in resource-constrained settings, and existing compression approaches often either degrade after domain adaptation or require substantial additional computation. We introduce EfficientXpert, a lightweight framework for domain pruning that integrates ForeSight Mask, a propagation-aware criterion for selecting weights to prune without backpropagation, and Partial Brain Surgeon, an efficient closed-form update for low-rank adapters under a fixed sparsity pattern. With fine-tuning cost comparable to standard LoRA, EfficientXpert converts a general pretrained model into a sparse, domain-adapted expert in a single pruning step. Across health and legal benchmarks, EfficientXpert reaches up to 98 percent of dense performance at 40 percent sparsity, improving over prior pruning baselines while matching LoRA training time and staying within 1 percent of LoRA peak GPU memory in our experiments.
[368] MCGrad: Multicalibration at Web Scale
Niek Tax, Lorenzo Perini, Fridolin Linder, Daniel Haimovich, Dima Karamshuk, Nastaran Okati, Milan Vojnovic, Pavlos Athanasios Apostolopoulos
Main category: cs.LG
TL;DR: MCGrad is a scalable multicalibration algorithm that doesn’t require manual subgroup specification, is production-ready, and often improves other ML metrics rather than harming them.
Details
Motivation: Existing multicalibration methods have limited industry adoption due to three main issues: they require manual specification of subgroups (which practitioners struggle with), lack scalability, and may harm other important model performance metrics like log loss and PRAUC.Method: MCGrad is a novel multicalibration algorithm that doesn’t require explicit specification of protected groups, is designed to be scalable, and is implemented to work efficiently in production environments.
Result: MCGrad has been successfully deployed at Meta as part of hundreds of production models, shows positive results from these deployments, performs well on public datasets, and often improves other ML evaluation metrics instead of degrading them.
Conclusion: MCGrad addresses key limitations of existing multicalibration methods, making multicalibration practical for industry adoption by eliminating manual subgroup specification, ensuring scalability, and maintaining or improving overall model performance.
Abstract: We propose MCGrad, a novel and scalable multicalibration algorithm. Multicalibration - calibration in subgroups of the data - is an important property for the performance of machine learning-based systems. Existing multicalibration methods have thus far received limited traction in industry. We argue that this is because existing methods (1) require such subgroups to be manually specified, which ML practitioners often struggle with, (2) are not scalable, or (3) may harm other notions of model performance such as log loss and Area Under the Precision-Recall Curve (PRAUC). MCGrad does not require explicit specification of protected groups, is scalable, and often improves other ML evaluation metrics instead of harming them. MCGrad has been in production at Meta, and is now part of hundreds of production models. We present results from these deployments as well as results on public datasets. We provide an open source implementation of MCGrad at https://github.com/facebookincubator/MCGrad.
[369] Signature-Informed Transformer for Asset Allocation
Yoontae Hwang, Stefan Zohren
Main category: cs.LG
TL;DR: Signature Informed Transformer unifies feature extraction and portfolio optimization in a single policy using path signatures and specialized attention to directly minimize CVaR, outperforming traditional methods.
Details
Motivation: Traditional deep learning for asset allocation separates forecasting from optimization, creating a fundamental mismatch where minimizing prediction errors doesn't yield robust portfolios. This decoupling fails to align the training objective with actual financial goals.Method: Proposes Signature Informed Transformer that unifies feature extraction and decision making into a single policy. Uses path signatures to encode complex path dependencies and introduces a specialized attention mechanism targeting geometric asset relationships. Directly minimizes Conditional Value at Risk (CVaR) to align training with financial objectives.
Result: The approach significantly outperforms both traditional strategies and advanced forecasting baselines across diverse equity universes. The authors prove that their attention module rigorously amplifies signature-derived signals.
Conclusion: Unifying feature extraction and portfolio optimization in a single policy framework with path signatures and specialized attention mechanisms, while directly optimizing for financial risk metrics like CVaR, leads to superior portfolio performance compared to decoupled forecasting-optimization approaches.
Abstract: Modern deep learning for asset allocation typically separates forecasting from optimization. We argue this creates a fundamental mismatch where minimizing prediction errors fails to yield robust portfolios. We propose the Signature Informed Transformer to address this by unifying feature extraction and decision making into a single policy. Our model employs path signatures to encode complex path dependencies and introduces a specialized attention mechanism that targets geometric asset relationships. By directly minimizing the Conditional Value at Risk we ensure the training objective aligns with financial goals. We prove that our attention module rigorously amplifies signature derived signals. Experiments across diverse equity universes show our approach significantly outperforms both traditional strategies and advanced forecasting baselines. The code is available at: https://anonymous.4open.science/r/Signature-Informed-Transformer-For-Asset-Allocation-DB88
[370] Spectral Generative Flow Models: A Physics-Inspired Replacement for Vectorized Large Language Models
Andrew Kiruluta
Main category: cs.LG
TL;DR: Spectral Generative Flow Models (SGFMs) are physics-inspired generative models that treat text/video as continuous field evolution using stochastic dynamics in wavelet basis, replacing transformers with local operators and spectral projections.
Details
Motivation: To provide a principled alternative to transformer-based LLMs and diffusion models by grounding generation in physics-inspired continuous field theory, enabling better long-range coherence, multimodal generality, and physically structured inductive biases.Method: Treats text/video as trajectories of stochastic PDEs, uses wavelet-domain representation for sparsity and scale separation, implements constrained stochastic flow with local operators, spectral projections, and Navier-Stokes-like transport instead of global attention.
Result: Proposes a novel generative architecture that fundamentally departs from autoregressive and diffusion-based approaches, offering theoretical framework for continuous field-based generation with computational efficiency.
Conclusion: SGFMs provide a physics-inspired path toward next-generation generative models with improved coherence, multimodal capabilities, and physically structured inductive biases through continuous field theory and wavelet representations.
Abstract: We introduce Spectral Generative Flow Models (SGFMs), a physics-inspired alternative to transformer-based large language models. Instead of representing text or video as sequences of discrete tokens processed by attention, SGFMs treat generation as the evolution of a continuous field governed by constrained stochastic dynamics in a multiscale wavelet basis. This formulation replaces global attention with local operators, spectral projections, and Navier–Stokes-like transport, yielding a generative mechanism grounded in continuity, geometry, and physical structure. Our framework provides three key innovations: (i) a field-theoretic ontology in which text and video are unified as trajectories of a stochastic partial differential equation; (ii) a wavelet-domain representation that induces sparsity, scale separation, and computational efficiency; and (iii) a constrained stochastic flow that enforces stability, coherence, and uncertainty propagation. Together, these components define a generative architecture that departs fundamentally from autoregressive modeling and diffusion-based approaches. SGFMs offer a principled path toward long-range coherence, multimodal generality, and physically structured inductive bias in next-generation generative models.
[371] Distributionally Robust Causal Abstractions
Yorgos Felekis, Theodoros Damoulas, Paris Giampouras
Main category: cs.LG
TL;DR: First distributionally robust causal abstraction learning framework with Wasserstein ambiguity sets to handle environmental shifts and model misspecification.
Details
Motivation: Existing causal abstraction learning methods assume fixed, well-specified exogenous distributions, making them vulnerable to environmental shifts and model misspecification.Method: Introduces distributionally robust causal abstractions with constrained min-max optimization using Wasserstein ambiguity sets. Provides algorithms with theoretical guarantees for empirical and Gaussian environments.
Result: Theoretical guarantees for ambiguity set radius selection and worst-case abstraction error bounds. Empirical evidence shows robustness to environmental shifts, structural misspecification, and intervention mapping misspecification.
Conclusion: First robust causal abstraction learning framework that addresses distributional uncertainty and model misspecification through distributionally robust optimization with Wasserstein ambiguity sets.
Abstract: Causal Abstraction (CA) theory provides a principled framework for relating causal models that describe the same system at different levels of granularity while ensuring interventional consistency between them. Recent methods for learning CAs, however, assume fixed and well-specified exogenous distributions, leaving them vulnerable to environmental shifts and model misspecification. In this work, we address these limitations by introducing the first class of distributionally robust CAs and their associated learning algorithms. The latter cast robust causal abstraction learning as a constrained min-max optimization problem with Wasserstein ambiguity sets. We provide theoretical guarantees for both empirical and Gaussian environments, enabling principled selection of ambiguity set radii and establish quantitative guarantees on worst-case abstraction error. Furthermore, we present empirical evidence across different problems and CA learning methods, demonstrating our framework’s robustness not only to environmental shifts but also to structural and intervention mapping misspecification.
[372] Who Benefits From Sinus Surgery? Comparing Generative AI and Supervised Machine Learning for Predicting Surgical Outcomes in Chronic Rhinosinusitis
Sayeed Shafayet Chowdhury, Snehasis Mukhopadhyay, Shiaofen Fang, Vijay R. Ramakrishnan
Main category: cs.LG
TL;DR: ML models outperform generative AI for predicting surgical outcomes in chronic rhinosinusitis, with MLP achieving 85% accuracy vs. underperforming GenAI models.
Details
Motivation: Despite AI advancements in medical imaging, there's limited use of AI for prospective clinical decision support. The study aims to predict which chronic rhinosinusitis patients would benefit from surgery using only pre-operative data to identify those who should avoid surgery.Method: Benchmarked supervised ML (logistic regression, tree ensembles, MLP) against generative AI (ChatGPT, Claude, Gemini, Perplexity) using structured pre-operative clinical data. Used prospectively collected cohort where all patients underwent surgery, with success defined as >8.9-point SNOT-22 reduction at 6 months.
Result: Best ML model (MLP) achieved 85% accuracy with superior calibration and decision-curve net benefit. GenAI models underperformed on discrimination and calibration in zero-shot setting. GenAI justifications aligned with clinician heuristics and MLP feature importance, highlighting baseline SNOT-22, CT/endoscopy severity, polyp phenotype, and psychology/pain comorbidities.
Conclusion: Supports ML-first, GenAI-augmented workflow: deploy calibrated ML for primary surgical candidacy triage, with GenAI as explainer to enhance transparency and shared decision-making. Provides reproducible tabular-to-GenAI evaluation protocol.
Abstract: Artificial intelligence has reshaped medical imaging, yet the use of AI on clinical data for prospective decision support remains limited. We study pre-operative prediction of clinically meaningful improvement in chronic rhinosinusitis (CRS), defining success as a more than 8.9-point reduction in SNOT-22 at 6 months (MCID). In a prospectively collected cohort where all patients underwent surgery, we ask whether models using only pre-operative clinical data could have identified those who would have poor outcomes, i.e. those who should have avoided surgery. We benchmark supervised ML (logistic regression, tree ensembles, and an in-house MLP) against generative AI (ChatGPT, Claude, Gemini, Perplexity), giving each the same structured inputs and constraining outputs to binary recommendations with confidence. Our best ML model (MLP) achieves 85 % accuracy with superior calibration and decision-curve net benefit. GenAI models underperform on discrimination and calibration across zero-shot setting. Notably, GenAI justifications align with clinician heuristics and the MLP’s feature importance, repeatedly highlighting baseline SNOT-22, CT/endoscopy severity, polyp phenotype, and physchology/pain comorbidities. We provide a reproducible tabular-to-GenAI evaluation protocol and subgroup analyses. Findings support an ML-first, GenAI- augmented workflow: deploy calibrated ML for primary triage of surgical candidacy, with GenAI as an explainer to enhance transparency and shared decision-making.
[373] Robust Reinforcement Learning in Finance: Modeling Market Impact with Elliptic Uncertainty Sets
Shaocong Ma, Heng Huang
Main category: cs.LG
TL;DR: Novel robust RL approach using elliptic uncertainty sets to handle directional market impact in financial trading, with closed-form solutions for worst-case uncertainty.
Details
Motivation: RL agents trained on historical data face performance degradation during live deployment due to market impact - their own trades shifting prices. Traditional robust RL uses symmetric uncertainty structures that don't capture the directional nature of market impact.Method: Developed a novel class of elliptic uncertainty sets to model directional market impact. Established both implicit and explicit closed-form solutions for worst-case uncertainty under these sets, enabling efficient robust policy evaluation.
Result: Experiments on single-asset and multi-asset trading tasks show superior Sharpe ratio and robustness under increasing trade volumes compared to traditional approaches.
Conclusion: The method offers a more faithful and scalable approach to RL in financial markets by properly addressing directional market impact through elliptic uncertainty sets with tractable solutions.
Abstract: In financial applications, reinforcement learning (RL) agents are commonly trained on historical data, where their actions do not influence prices. However, during deployment, these agents trade in live markets where their own transactions can shift asset prices, a phenomenon known as market impact. This mismatch between training and deployment environments can significantly degrade performance. Traditional robust RL approaches address this model misspecification by optimizing the worst-case performance over a set of uncertainties, but typically rely on symmetric structures that fail to capture the directional nature of market impact. To address this issue, we develop a novel class of elliptic uncertainty sets. We establish both implicit and explicit closed-form solutions for the worst-case uncertainty under these sets, enabling efficient and tractable robust policy evaluation. Experiments on single-asset and multi-asset trading tasks demonstrate that our method achieves superior Sharpe ratio and remains robust under increasing trade volumes, offering a more faithful and scalable approach to RL in financial markets.
[374] PaTaRM: Bridging Pairwise and Pointwise Signals via Preference-Aware Task-Adaptive Reward Modeling
Ai Jian, Jingqing Ruan, Xing Ma, Dailin Li, Weipeng Zhang, Ke Zeng, Xunliang Cai
Main category: cs.LG
TL;DR: PaTaRM is a novel reward model that enables pointwise training using pairwise data via Preference-Aware Reward mechanism and Task-Adaptive Rubric system, improving both reward modeling and downstream RLHF performance.
Details
Motivation: Generative reward models offer better interpretability than scalar reward models, but face a trade-off: pairwise methods suffer from training-inference mismatch, while pointwise methods require expensive absolute annotations. There's a need to bridge this gap using readily available pairwise data.Method: PaTaRM introduces two key components: 1) Preference-Aware Reward (PAR) mechanism that enables robust pointwise training using pairwise data without explicit rating labels, and 2) Task-Adaptive Rubric system that dynamically generates instance-specific evaluation criteria for precise assessment.
Result: PaTaRM achieves 8.7% average improvement on RewardBench and RMBench across Qwen3-8B/14B models. More importantly, it boosts downstream RLHF performance by 13.6% average relative improvement across IFEval and InFoBench benchmarks.
Conclusion: PaTaRM successfully bridges the gap between pairwise and pointwise reward modeling by enabling pointwise training with pairwise data, providing both interpretability and improved alignment performance for RLHF applications.
Abstract: Reward models (RMs) are central to reinforcement learning from human feedback (RLHF), providing the critical supervision signals that align large language models (LLMs) with human preferences. Generative reward models (GRMs) provide greater interpretability than traditional scalar RMs, but they come with a critical trade-off: pairwise methods are hindered by a training-inference mismatch, while pointwise methods require expensive absolute annotations. To bridge this gap, we propose the Preference-aware Task-adaptive Reward Model (PaTaRM). Unlike prior approaches, PaTaRM enables robust pointwise training using readily available pairwise data via a novel Preference-Aware Reward (PAR) mechanism, eliminating the need for explicit rating labels. Furthermore, it incorporates a Task-Adaptive Rubric system that dynamically generates instance-specific criteria for precise evaluation. Extensive experiments demonstrate that PATRM achieves a 8.7% average improvement on RewardBench and RMBench across Qwen3-8B/14B models. Crucially, it boosts downstream RLHF performance by an average relative improvement of 13.6% across IFEval and InFoBench, validating its effectiveness for policy alignment. Our code is available at https://github.com/JaneEyre0530/PaTaRM.
[375] Mechanism Shift During Post-training from Autoregressive to Masked Diffusion Language Models
Injin Kong, Hyoungjoon Lee, Yohan Jo
Main category: cs.LG
TL;DR: Post-training ARMs into MDMs causes fundamental reorganization of internal computation, not just parameter adaptation, enabling non-sequential global planning.
Details
Motivation: To understand whether post-trained MDMs acquire genuine bidirectional reasoning capabilities or merely repackage autoregressive heuristics, and to explore the internal algorithmic transformations induced by this paradigm shift.Method: Comparative circuit analysis of ARMs and their MDM counterparts, examining both structural and semantic changes in model internals.
Result: MDMs largely retain autoregressive circuitry for local causal tasks but abandon initialized pathways for global planning tasks, showing distinct rewiring with increased early-layer processing. Semantically, there’s a transition from sharp, localized specialization in ARMs to distributed integration in MDMs.
Conclusion: Diffusion post-training fundamentally reorganizes internal computation to support non-sequential global planning, rather than merely adapting model parameters.
Abstract: Post-training pretrained Autoregressive models (ARMs) into Masked Diffusion models (MDMs) has emerged as a cost-effective strategy to overcome the limitations of sequential generation. However, the internal algorithmic transformations induced by this paradigm shift remain unexplored, leaving it unclear whether post-trained MDMs acquire genuine bidirectional reasoning capabilities or merely repackage autoregressive heuristics. In this work, we address this question by conducting a comparative circuit analysis of ARMs and their MDM counterparts. Our analysis reveals a systematic “mechanism shift” dependent on the structural nature of the task. Structurally, we observe a distinct divergence: while MDMs largely retain autoregressive circuitry for tasks dominated by local causal dependencies, they abandon initialized pathways for global planning tasks, exhibiting distinct rewiring characterized by increased early-layer processing. Semantically, we identify a transition from sharp, localized specialization in ARMs to distributed integration in MDMs. Through these findings, we conclude that diffusion post-training does not merely adapt model parameters but fundamentally reorganizes internal computation to support non-sequential global planning.
[376] One Router to Route Them All: Homogeneous Expert Routing for Heterogeneous Graph Transformers
Georgiy Shakirov, Albert Arakelov
Main category: cs.LG
TL;DR: HER integrates Mixture-of-Experts into Heterogeneous Graph Transformers with type-agnostic routing, outperforming type-specific approaches and enabling semantic specialization.
Details
Motivation: Current HGNNs condition parameters on node/edge types, causing overreliance on surface-level labels and impeding cross-type knowledge transfer. MoE's success in homogeneous settings suggests potential for heterogeneous graphs, but the need for type-specific experts is questionable.Method: Proposes Homogeneous Expert Routing (HER), an MoE layer for Heterogeneous Graph Transformers that stochastically masks type embeddings during routing to encourage type-agnostic specialization. This regularization prevents experts from relying on type information.
Result: HER consistently outperforms standard HGT and type-separated MoE baselines on IMDB, ACM, and DBLP datasets for link prediction. Analysis shows HER experts specialize by semantic patterns (e.g., movie genres) rather than node types, confirming routing is driven by latent semantics.
Conclusion: Regularizing type dependence in expert routing yields more generalizable, efficient, and interpretable representations. This establishes a new design principle for heterogeneous graph learning: encouraging type-agnostic specialization through stochastic type masking.
Abstract: A common practice in heterogeneous graph neural networks (HGNNs) is to condition parameters on node/edge types, assuming types reflect semantic roles. However, this can cause overreliance on surface-level labels and impede cross-type knowledge transfer. We explore integrating Mixture-of-Experts (MoE) into HGNNs–a direction underexplored despite MoE’s success in homogeneous settings. Crucially, we question the need for type-specific experts. We propose Homogeneous Expert Routing (HER), an MoE layer for Heterogeneous Graph Transformers (HGT) that stochastically masks type embeddings during routing to encourage type-agnostic specialization. Evaluated on IMDB, ACM, and DBLP for link prediction, HER consistently outperforms standard HGT and a type-separated MoE baseline. Analysis on IMDB shows HER experts specialize by semantic patterns (e.g., movie genres) rather than node types, confirming routing is driven by latent semantics. Our work demonstrates that regularizing type dependence in expert routing yields more generalizable, efficient, and interpretable representations–a new design principle for heterogeneous graph learning.
[377] Integrating Neural Differential Forecasting with Safe Reinforcement Learning for Blood Glucose Regulation
Yushen Liu, Yanfu Zhang, Xugui Zhou
Main category: cs.LG
TL;DR: TSODE integrates Thompson Sampling RL with NeuralODE forecasting and conformal calibration for safe, personalized insulin delivery in Type 1 Diabetes, achieving 87.9% time-in-range with minimal hypoglycemia.
Details
Motivation: Existing RL approaches for automated insulin delivery struggle to simultaneously guarantee safety while achieving personalized glucose control, risking issues like overdosing before meals or stacking corrections.Method: TSODE combines Thompson Sampling RL with a NeuralODE forecaster that predicts short-term glucose trajectories conditioned on insulin doses, plus a conformal calibration layer that quantifies predictive uncertainty to reject or scale risky actions.
Result: In FDA-approved UVa/Padova simulator (adult cohort), TSODE achieved 87.9% time-in-range with less than 10% time below 70 mg/dL, outperforming relevant baselines.
Conclusion: Integrating adaptive RL with calibrated NeuralODE forecasting enables interpretable, safe, and robust glucose regulation for Type 1 Diabetes management.
Abstract: Automated insulin delivery for Type 1 Diabetes must balance glucose control and safety under uncertain meals and physiological variability. While reinforcement learning (RL) enables adaptive personalization, existing approaches struggle to simultaneously guarantee safety, leaving a gap in achieving both personalized and risk-aware glucose control, such as overdosing before meals or stacking corrections. To bridge this gap, we propose TSODE, a safety-aware controller that integrates Thompson Sampling RL with a Neural Ordinary Differential Equation (NeuralODE) forecaster to address this challenge. Specifically, the NeuralODE predicts short-term glucose trajectories conditioned on proposed insulin doses, while a conformal calibration layer quantifies predictive uncertainty to reject or scale risky actions. In the FDA-approved UVa/Padova simulator (adult cohort), TSODE achieved 87.9% time-in-range with less than 10% time below 70 mg/dL, outperforming relevant baselines. These results demonstrate that integrating adaptive RL with calibrated NeuralODE forecasting enables interpretable, safe, and robust glucose regulation.
[378] Radiation-Preserving Selective Imaging for Pediatric Hip Dysplasia: A Cross-Modal Ultrasound-Xray Policy with Limited Labels
Duncan Stothers, Ben Stothers, Emily Schaeffer, Kishore Mulpuri
Main category: cs.LG
TL;DR: A deep learning pipeline for DDH screening that uses ultrasound-first with conformal deferral rules to reduce unnecessary radiographs while maintaining diagnostic coverage guarantees.
Details
Motivation: To develop a radiation-preserving policy for developmental dysplasia of the hip (DDH) that minimizes unnecessary radiographs while maintaining diagnostic accuracy, using ultrasound as the primary screening modality.Method: Three-step approach: (1) Pretrain modality-specific encoders (ResNet-18) with SimSiam on large unlabeled ultrasound and radiograph datasets, (2) Freeze backbones and train small measurement-specific heads on DDH landmarks, (3) Calibrate one-sided conformal deferral rules on ultrasound predictions with finite sample coverage guarantees.
Result: Ultrasound measurement errors: alpha MAE ~9.7°, coverage MAE ~14.0%; Radiographic measurements: AI MAE ~7.6°, CE MAE ~8.9°. Calibrated policies achieve tunable trade-offs between US-only throughput and diagnostic coverage.
Conclusion: The pipeline provides a simple, reproducible method to convert limited labels into interpretable measurements with tunable selective imaging policies suitable for clinical implementation and future validation.
Abstract: We study an ultrasound-first, radiation-preserving policy for developmental dysplasia of the hip (DDH) that requests a radiograph only when needed. We (i) pretrain modality-specific encoders (ResNet-18) with SimSiam on a large unlabelled registry (37186 ultrasound; 19546 radiographs), (ii) freeze the backbones and fit small, measurement-faithful heads on DDH-relevant landmarks and measurements, (iii) calibrate a one-sided conformal deferral rule on ultrasound predictions that provides finite sample marginal coverage guarantees under exchangeability, using a held-out calibration set. Ultrasound heads predict Graf alpha, beta, and femoral head coverage; X-ray heads predict acetabular index (AI), center-edge (CE) angle and IHDI grade. On our held out labeled evaluation set, ultrasound measurement error is modest (e.g., alpha MAE ~= 9.7 degrees, coverage MAE ~= 14.0%), while radiographic probes achieve AI and CE MAEs of ~= 7.6 degrees and ~= 8.9 degrees, respectively. The calibrated US-only policy is explored across rule families (alpha-only; alpha OR coverage; alpha AND coverage), conformal miscoverage levels, and per-utility trade-offs using decision-curve analysis. Conservative settings yield high coverage with near-zero US-only rates; permissive settings (e.g., alpha OR coverage at larger deltas) achieve non-zero US-only throughput with expected coverage tradeoffs. The result is a simple, reproducible pipeline that turns limited labels into interpretable measurements and tunable selective imaging curves suitable for clinical handoff and future external validation.
[379] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter
Qinghao Hu, Shang Yang, Junxian Guo, Xiaozhe Yao, Yujun Lin, Yuxian Gu, Han Cai, Chuang Gan, Ana Klimovic, Song Han
Main category: cs.LG
TL;DR: TLT accelerates reasoning RL training for LLMs using adaptive speculative decoding to overcome efficiency bottlenecks from long-tail response generation.
Details
Motivation: Training reasoning LLMs with RL suffers from efficiency bottlenecks due to long-tail response generation where few very long responses dominate execution time, wasting resources and increasing costs.Method: TLT integrates adaptive speculative decoding with two components: (1) Adaptive Drafter - lightweight draft model trained continuously on idle GPUs during long-tail generation to maintain alignment with target model; (2) Adaptive Rollout Engine - maintains memory-efficient pool of pre-captured CUDAGraphs and adaptively selects suitable speculative decoding strategies for each input batch.
Result: TLT achieves over 1.7x end-to-end RL training speedup over state-of-the-art systems while preserving model accuracy, and yields a high-quality draft model as a free byproduct suitable for efficient deployment.
Conclusion: TLT successfully accelerates reasoning RL training losslessly through adaptive speculative decoding, overcoming challenges of dynamic workloads, evolving target models, and draft model training overhead.
Abstract: The emergence of Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving. However, training these reasoning models, typically using Reinforcement Learning (RL), encounters critical efficiency bottlenecks: response generation during RL training exhibits a persistent long-tail distribution, where a few very long responses dominate execution time, wasting resources and inflating costs. To address this, we propose TLT, a system that accelerates reasoning RL training losslessly by integrating adaptive speculative decoding. Applying speculative decoding in RL is challenging due to the dynamic workloads, evolving target model, and draft model training overhead. TLT overcomes these obstacles with two synergistic components: (1) Adaptive Drafter, a lightweight draft model trained continuously on idle GPUs during long-tail generation to maintain alignment with the target model at no extra cost; and (2) Adaptive Rollout Engine, which maintains a memory-efficient pool of pre-captured CUDAGraphs and adaptively select suitable SD strategies for each input batch. Evaluations demonstrate that TLT achieves over 1.7x end-to-end RL training speedup over state-of-the-art systems, preserves the model accuracy, and yields a high-quality draft model as a free byproduct suitable for efficient deployment. Code is released at https://github.com/mit-han-lab/fastrl.
[380] Boundary-Aware Adversarial Filtering for Reliable Diagnosis under Extreme Class Imbalance
Yanxuan Yu, Michael S. Hughes, Julien Lee, Jiacheng Zhou, Andrew F. Laine
Main category: cs.LG
TL;DR: AF-SMOTE: An adversarial filtering framework for imbalanced classification that synthesizes minority points then filters them to improve recall and calibration, validated on medical diagnosis and fraud detection tasks.
Details
Motivation: Address classification under extreme class imbalance where both recall and calibration are critical, particularly in medical diagnosis scenarios where missing true positive cases in rare diseases can have severe consequences.Method: Proposes AF-SMOTE: 1) synthesizes minority points using SMOTE-like augmentation, 2) filters synthesized points using an adversarial discriminator and boundary utility model to select only beneficial samples.
Result: Proves theoretical guarantees: filtering monotonically improves F_beta surrogate (beta >= 1) without inflating Brier score. Outperforms SMOTE, ADASYN, Borderline-SMOTE, SVM-SMOTE on MIMIC-IV and fraud detection benchmarks with higher recall, average precision, and best calibration.
Conclusion: AF-SMOTE effectively addresses extreme class imbalance with theoretical guarantees and practical validation, demonstrating value in clinical applications through successful use with proxy labels on healthcare data.
Abstract: We study classification under extreme class imbalance where recall and calibration are both critical, for example in medical diagnosis scenarios. We propose AF-SMOTE, a mathematically motivated augmentation framework that first synthesizes minority points and then filters them by an adversarial discriminator and a boundary utility model. We prove that, under mild assumptions on the decision boundary smoothness and class-conditional densities, our filtering step monotonically improves a surrogate of F_beta (for beta >= 1) while not inflating Brier score. On MIMIC-IV proxy label prediction and canonical fraud detection benchmarks, AF-SMOTE attains higher recall and average precision than strong oversampling baselines (SMOTE, ADASYN, Borderline-SMOTE, SVM-SMOTE), and yields the best calibration. We further validate these gains across multiple additional datasets beyond MIMIC-IV. Our successful application of AF-SMOTE to a healthcare dataset using a proxy label demonstrates in a disease-agnostic way its practical value in clinical situations, where missing true positive cases in rare diseases can have severe consequences.
[381] Interpretable Air Pollution Forecasting by Physics-Guided Spatiotemporal Decoupling
Zhiguo Zhang, Xiaoliang Ma, Daniel Schlesinger
Main category: cs.LG
TL;DR: A physics-guided interpretable spatiotemporal learning framework for air pollution forecasting that decomposes pollutant behavior into transparent additive modules, outperforming state-of-the-art baselines while maintaining interpretability.
Details
Motivation: There's a trade-off between performance and interpretability in current air pollution forecasting models, but accurate and interpretable forecasting is crucial for public health and operational air-quality management.Method: Proposes a physics-guided, interpretable-by-design framework that decomposes spatiotemporal behavior into two additive modules: 1) physics-guided transport kernel with directed weights conditioned on wind and geography (advection), and 2) explainable attention mechanism that learns local responses and attributes future concentrations to specific historical lags and exogenous drivers.
Result: The model consistently outperforms state-of-the-art baselines across multiple forecasting horizons when evaluated on comprehensive dataset from the Stockholm region.
Conclusion: The integration of high predictive performance with spatiotemporal interpretability provides a more reliable foundation for operational air-quality management in real-world applications.
Abstract: Accurate and interpretable air pollution forecasting is crucial for public health, but most models face a trade-off between performance and interpretability. This study proposes a physics-guided, interpretable-by-design spatiotemporal learning framework. The model decomposes the spatiotemporal behavior of air pollutant concentrations into two transparent, additive modules. The first is a physics-guided transport kernel with directed weights conditioned on wind and geography (advection). The second is an explainable attention mechanism that learns local responses and attributes future concentrations to specific historical lags and exogenous drivers. Evaluated on a comprehensive dataset from the Stockholm region, our model consistently outperforms state-of-the-art baselines across multiple forecasting horizons. Our model’s integration of high predictive performance and spatiotemporal interpretability provides a more reliable foundation for operational air-quality management in real-world applications.
[382] Sigma: The Key for Vision-Language-Action Models toward Telepathic Alignment
Libo Wang
Main category: cs.LG
TL;DR: Sigma is a vision-language-action model that integrates semantic understanding with associative reasoning to enable telepathic-style alignment between perception and action, achieving improved control performance without retraining the base model.
Details
Motivation: The paper addresses a fundamental limitation in cognitive systems: the absence of a time-updatable mediating thought space between semantics and continuous control. Current systems lack proper integration between deep semantic understanding and associative reasoning for perception-action alignment.Method: Built Sigma model on pi0.5_base backbone using svla_so101_pickplace dataset. Introduced independently designed VLA architecture integrating semantic understanding with associative reasoning. Used iterative optimization of data preprocessing, LoRA-based fine-tuning, and inference-stage adapter design.
Result: Sigma showed consistent reduction in control MSE across vector-, fragment-, and trajectory-level scales compared to untuned pi0.5_base. Preserved stability of telepathy norm and semantic-text alignment quality. Achieved mind-responsive alignment control without retraining base model.
Conclusion: Mind-responsive alignment control can be quantitatively achieved through semantic and associative architectural integration without retraining the base model. Provides reproducible pathway for semantic alignment and intention-driven behavior in cognitive systems.
Abstract: To address a fundamental limitation in cognitive systems, namely the absence of a time-updatable mediating thought space between semantics and continuous control, this work constructs and trains a vision-language-action model termed Sigma, deployed on a single RTX 4090. The model is built upon the open-source pi0.5_base backbone, with the svla_so101_pickplace dataset preprocessed into a structured training corpus. An independently designed VLA architecture is introduced to integrate deep semantic understanding with associative reasoning, enabling telepathic-style alignment between perception and action. Training proceeds through iterative optimization of data preprocessing, LoRA-based fine-tuning, and inference-stage adapter design. Evaluation is conducted using offline closed-loop replay, comparing Sigma against the untuned pi0.5_base under identical data conditions. Experimental results indicate a consistent reduction in control MSE across vector-, fragment-, and trajectory-level scales, while preserving the stability of the telepathy norm and semantic-text alignment quality. These findings demonstrate that mind-responsive alignment control can be quantitatively achieved through semantic and associative architectural integration without retraining the base model, providing a reproducible pathway for semantic alignment and intention-driven behavior.
[383] DS FedProxGrad: Asymptotic Stationarity Without Noise Floor in Fair Federated Learning
Huzaifa Arif
Main category: cs.LG
TL;DR: Improved asymptotic convergence analysis for FedProxGrad in group fair federated learning, showing convergence to exact stationarity without variance-induced noise floor.
Details
Motivation: Previous FedProxGrad analysis only showed convergence to a noise-dominated neighborhood with explicit dependence on variance-induced noise floor, which is suboptimal for non-convex composite optimization in group fair federated learning.Method: Extended FedProxGrad framework called DS-FedProxGrad (Decay Step Size FedProxGrad) with Robbins-Monro step-size schedule and mild decay condition on local inexactness for solving non-convex composite optimization with explicit fairness regularization.
Result: Proved that liminf_{r→∞} 𝔼[‖∇F(x^r)‖²] = 0, meaning algorithm is asymptotically stationary and convergence rate does not depend on variance-induced noise floor.
Conclusion: DS-FedProxGrad provides improved asymptotic convergence guarantees for group fair federated learning, eliminating the noise floor limitation of previous analysis and ensuring convergence to exact stationarity.
Abstract: Recent work \cite{arifgroup} introduced Federated Proximal Gradient \textbf{(\texttt{FedProxGrad})} for solving non-convex composite optimization problems in group fair federated learning. However, the original analysis established convergence only to a \textit{noise-dominated neighborhood of stationarity}, with explicit dependence on a variance-induced noise floor. In this work, we provide an improved asymptotic convergence analysis for a generalized \texttt{FedProxGrad}-type analytical framework with inexact local proximal solutions and explicit fairness regularization. We call this extended analytical framework \textbf{DS \texttt{FedProxGrad}} (Decay Step Size \texttt{FedProxGrad}). Under a Robbins-Monro step-size schedule \cite{robbins1951stochastic} and a mild decay condition on local inexactness, we prove that $\liminf_{r\to\infty} \mathbb{E}[|\nabla F(\mathbf{x}^r)|^2] = 0$, i.e., the algorithm is asymptotically stationary and the convergence rate does not depend on a variance-induced noise floor.
[384] Dynamics of Agentic Loops in Large Language Models: A Geometric Theory of Trajectories
Nicolas Tacheny
Main category: cs.LG
TL;DR: A geometric framework analyzes agentic LLM loops as dynamical systems, revealing contractive vs. divergent regimes controlled by prompt design.
Details
Motivation: Agentic systems using LLMs operate through recursive feedback loops, but their geometric behavior (convergence, divergence, complex dynamics) remains poorly understood, creating a need for analytical frameworks.Method: Introduces a geometric framework treating iterative LLM transformations as discrete dynamical systems, distinguishing artifact space from embedding space, and using isotonic calibration to eliminate cosine similarity bias for accurate trajectory measurement.
Result: Identifies two fundamental regimes: contractive rewriting loops converge to stable attractors with decreasing dispersion, while exploratory summarize-and-negate loops produce unbounded divergence without cluster formation, each with distinct geometric signatures.
Conclusion: Prompt design directly governs the dynamical regime of agentic loops, enabling systematic control over convergence, divergence, and trajectory structure in iterative LLM transformations.
Abstract: Agentic systems built on large language models operate through recursive feedback loops, where each output becomes the next input. Yet the geometric behavior of these agentic loops (whether they converge, diverge, or exhibit more complex dynamics) remains poorly understood. This paper introduces a geometric framework for analyzing agentic trajectories in semantic embedding space, treating iterative transformations as discrete dynamical systems. We distinguish the artifact space, where linguistic transformations occur, from the embedding space, where geometric measurements are performed. Because cosine similarity is biased by embedding anisotropy, we introduce an isotonic calibration that eliminates systematic bias and aligns similarities with human semantic judgments while preserving high local stability. This enables rigorous measurement of trajectories, clusters and attractors. Through controlled experiments on singular agentic loops, we identify two fundamental regimes. A contractive rewriting loop converges toward a stable attractor with decreasing dispersion, while an exploratory summarize and negate loop produces unbounded divergence with no cluster formation. These regimes display qualitatively distinct geometric signatures of contraction and expansion. Our results show that prompt design directly governs the dynamical regime of an agentic loop, enabling systematic control of convergence, divergence and trajectory structure in iterative LLM transformations.
[385] SeVeDo: A Heterogeneous Transformer Accelerator for Low-Bit Inference via Hierarchical Group Quantization and SVD-Guided Mixed Precision
Yuseon Choi, Sangjin Kim, Jungjun Oh, Byeongcheol Kim, Hoi-Jun Yoo
Main category: cs.LG
TL;DR: SeVeDo is an energy-efficient SVD-based accelerator that separates outlier-sensitive components into high-precision paths while using low-bit quantization for remaining computations, achieving up to 13.8TOPS/W efficiency.
Details
Motivation: Low-bit quantization for efficient transformer inference faces challenges with activation outliers causing accuracy degradation, while existing methods incur high energy consumption despite achieving good accuracy.Method: Proposes SeVeDo accelerator with structural separation: outlier-sensitive components go through high-precision low-rank path, remaining computations use low-bit residual datapath with group quantization. Includes Hierarchical Group Quantization (HGQ) combining coarse-grained floating-point scaling with fine-grained shifting, and SVD-guided mixed precision (SVD-MP) for static bitwidth allocation.
Result: Achieves peak energy efficiency of 13.8TOPS/W, with 12.7TOPS/W on ViT-Base and 13.4TOPS/W on Llama2-7B benchmarks, surpassing conventional designs.
Conclusion: SeVeDo provides an energy-efficient solution for low-bit transformer inference by structurally separating precision requirements and optimizing quantization methods, effectively balancing accuracy and energy consumption.
Abstract: Low-bit quantization is a promising technique for efficient transformer inference by reducing computational and memory overhead. However, aggressive bitwidth reduction remains challenging due to activation outliers, leading to accuracy degradation. Existing methods, such as outlier-handling and group quantization, achieve high accuracy but incur substantial energy consumption. To address this, we propose SeVeDo, an energy-efficient SVD-based heterogeneous accelerator that structurally separates outlier-sensitive components into a high-precision low-rank path, while the remaining computations are executed in a low-bit residual datapath with group quantization. To further enhance efficiency, Hierarchical Group Quantization (HGQ) combines coarse-grained floating-point scaling with fine-grained shifting, effectively reducing dequantization cost. Also, SVD-guided mixed precision (SVD-MP) statically allocates higher bitwidths to precision-sensitive components identified through low-rank decomposition, thereby minimizing floating-point operation cost. Experimental results show that SeVeDo achieves a peak energy efficiency of 13.8TOPS/W, surpassing conventional designs, with 12.7TOPS/W on ViT-Base and 13.4TOPS/W on Llama2-7B benchmarks.
[386] Constraint Breeds Generalization: Temporal Dynamics as an Inductive Bias
Xia Chen
Main category: cs.LG
TL;DR: The paper argues that physical constraints in biological systems serve as temporal inductive biases that promote generalization, not limitations. Through phase-space analysis, it shows that proper dissipative dynamics compress phase space to abstract invariant features, which can be imposed via input encoding or intrinsic network dynamics.
Details
Motivation: Conventional deep learning focuses on unconstrained optimization, but biological systems operate under strict metabolic constraints. The authors propose these constraints actually shape dynamics to function as temporal inductive biases that breed generalization, rather than being limitations.Method: Phase-space analysis of signal propagation reveals fundamental asymmetry: expansive dynamics amplify noise, while proper dissipative dynamics compress phase space to align with network’s spectral bias. This condition can be imposed externally via input encoding or intrinsically through network’s temporal dynamics. Both require architectures capable of temporal integration and proper constraints.
Result: Comprehensive evaluations across supervised classification, unsupervised reconstruction, and zero-shot reinforcement learning demonstrate that a critical “transition” regime maximizes generalization capability. This establishes dynamical constraints as a distinct class of inductive bias.
Conclusion: Robust AI development requires not only scaling and removing limitations, but computationally mastering the temporal characteristics that naturally promote generalization. Dynamical constraints should be viewed as a distinct class of inductive bias rather than limitations.
Abstract: Conventional deep learning prioritizes unconstrained optimization, yet biological systems operate under strict metabolic constraints. We propose that these physical constraints shape dynamics to function not as limitations, but as a temporal inductive bias that breeds generalization. Through a phase-space analysis of signal propagation, we reveal a fundamental asymmetry: expansive dynamics amplify noise, whereas proper dissipative dynamics compress phase space that aligns with the network’s spectral bias, compelling the abstraction of invariant features. This condition can be imposed externally via input encoding, or intrinsically through the network’s own temporal dynamics. Both pathways require architectures capable of temporal integration and proper constraints to decode induced invariants, whereas static architectures fail to capitalize on temporal structure. Through comprehensive evaluations across supervised classification, unsupervised reconstruction, and zero-shot reinforcement learning, we demonstrate that a critical “transition” regime maximizes generalization capability. These findings establish dynamical constraints as a distinct class of inductive bias, suggesting that robust AI development requires not only scaling and removing limitations, but computationally mastering the temporal characteristics that naturally promote generalization.
[387] Do Sparse Autoencoders Identify Reasoning Features in Language Models?
George Ma, Zhongyuan Liang, Irene Y. Chen, Somayeh Sojoudi
Main category: cs.LG
TL;DR: Sparse autoencoders (SAEs) fail to identify genuine reasoning features in LLMs due to sparsity bias favoring low-dimensional token-level patterns over high-dimensional reasoning processes.
Details
Motivation: To determine whether sparse autoencoders actually capture genuine reasoning features in large language models, or if they're biased toward superficial linguistic patterns that merely co-occur with reasoning.Method: 1) Theoretical analysis showing sparsity regularization favors stable low-dimensional correlates; 2) Falsification framework combining causal token injection with LLM-guided counterexample generation; 3) Extensive evaluation across 22 configurations spanning multiple models, layers, and datasets.
Result: 45%-90% of contrastively selected reasoning features activate when only a few associated tokens are injected into non-reasoning text. LLM-guided falsification constructs non-reasoning inputs that trigger feature activation, and paraphrases that suppress it. Steering top features yields no benchmark improvements.
Conclusion: SAEs systematically favor low-dimensional linguistic patterns over genuine high-dimensional reasoning features due to sparsity bias, raising concerns about using SAEs for mechanistic interpretability of reasoning in LLMs.
Abstract: We investigate whether sparse autoencoders identify genuine reasoning features in large language models. We first present a stylized theoretical analysis showing that sparsity-regularized decoding favors stable low-dimensional correlates over high-dimensional within-reasoning variation, biasing learned features toward token-level cues. Motivated by this analysis, we introduce a falsification-based evaluation framework that combines causal token injection with LLM-guided counterexample generation to distinguish genuine reasoning features from superficial linguistic correlates. Across 22 configurations spanning multiple model families, layers and datasets, we find that contrastively selected reasoning features are highly sensitive to token interventions, with 45%-90% activating when only a few associated tokens are injected into non-reasoning text. For the remaining features, LLM-guided falsification reliably constructs non-reasoning inputs that instantiate the feature’s token-level cues and trigger activation, and meaning-preserving paraphrases of top-activating reasoning traces that suppress it. Steering the highest-ranked features yields no improvements on benchmarks. Overall, our results suggest that when low-dimensional token-level patterns are coupled with high-dimensional reasoning processes, the sparsity bias of SAEs systematically favors low-dimensional linguistic patterns that consistently co-occur with reasoning. Code is available at https://github.com/GeorgeMLP/reasoning-probing.
[388] ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking
Qiang Zhang, Boli Chen, Fanrui Zhang, Ruixue Ding, Shihang Wang, Qiuchen Wang, Yinfeng Huang, Haonan Zhang, Rongxiang Zhu, Pengyong Wang, Ailin Ren, Xin Li, Pengjun Xie, Jiawei Liu, Ning Guo, Jingren Zhou, Zheng-Jun Zha
Main category: cs.LG
TL;DR: ArenaRL introduces a reinforcement learning paradigm that shifts from pointwise scoring to intra-group relative ranking for open-ended agent tasks, addressing discrimination collapse in reward models.
Details
Motivation: Current RL algorithms struggle with open-ended agent tasks with vast solution spaces (e.g., complex travel planning) due to discrimination collapse in reward models - pointwise scoring compresses subtle advantages into narrow ranges, making reward signals dominated by noise and causing optimization stagnation.Method: ArenaRL shifts from pointwise scalar scoring to intra-group relative ranking. It introduces: 1) process-aware pairwise evaluation with multi-level rubrics for fine-grained relative scores, 2) intra-group adversarial arena with tournament-based ranking scheme (seeded single-elimination) that achieves O(N) complexity while maintaining accuracy comparable to O(N²) full pairwise comparisons.
Result: Empirical results show the seeded single-elimination scheme achieves nearly equivalent advantage estimation accuracy to full pairwise comparisons with O(N²) complexity, while operating with only O(N) complexity. ArenaRL substantially outperforms standard RL baselines on newly built benchmarks (Open-Travel and Open-DeepResearch).
Conclusion: ArenaRL enables LLM agents to generate more robust solutions for complex real-world tasks by addressing discrimination collapse through relative ranking instead of pointwise scoring, with efficient tournament-based ranking that balances precision and computational efficiency.
Abstract: Reinforcement learning has substantially improved the performance of LLM agents on tasks with verifiable outcomes, but it still struggles on open-ended agent tasks with vast solution spaces (e.g., complex travel planning). Due to the absence of objective ground-truth for these tasks, current RL algorithms largely rely on reward models that assign scalar scores to individual responses. We contend that such pointwise scoring suffers from an inherent discrimination collapse: the reward model struggles to distinguish subtle advantages among different trajectories, resulting in scores within a group being compressed into a narrow range. Consequently, the effective reward signal becomes dominated by noise from the reward model, leading to optimization stagnation. To address this, we propose ArenaRL, a reinforcement learning paradigm that shifts from pointwise scalar scoring to intra-group relative ranking. ArenaRL introduces a process-aware pairwise evaluation mechanism, employing multi-level rubrics to assign fine-grained relative scores to trajectories. Additionally, we construct an intra-group adversarial arena and devise a tournament-based ranking scheme to obtain stable advantage signals. Empirical results confirm that the built seeded single-elimination scheme achieves nearly equivalent advantage estimation accuracy to full pairwise comparisons with O(N^2) complexity, while operating with only O(N) complexity, striking an optimal balance between efficiency and precision. Furthermore, to address the lack of full-cycle benchmarks for open-ended agents, we build Open-Travel and Open-DeepResearch, two high-quality benchmarks featuring a comprehensive pipeline covering SFT, RL training, and multi-dimensional evaluation. Extensive experiments show that ArenaRL substantially outperforms standard RL baselines, enabling LLM agents to generate more robust solutions for complex real-world tasks.
[389] Enhancing Large Language Models for Time-Series Forecasting via Vector-Injected In-Context Learning
Jianqi Zhang, Jingyao Wang, Wenwen Qiang, Fanjiang Xu, Changwen Zheng
Main category: cs.LG
TL;DR: LVICL: Vector-injected in-context learning for time series forecasting with frozen LLMs, improving performance without computational overhead.
Details
Motivation: LLMs for time series forecasting face dual challenges: performance degradation due to domain mismatch between pretraining corpora and time series data, and high computational overhead from fine-tuning. Need method to improve forecasting performance while keeping LLM parameters frozen.Method: Propose LVICL (vector-injected ICL) that uses learnable context vector adapter to extract compressed example information from multiple time series examples, then injects this vector into every layer of frozen LLM during forward pass, eliciting in-context learning ability without increasing prompt length.
Result: Extensive experiments demonstrate effectiveness of approach. Vector injection suppresses harmful components and improves forecasting performance compared to conventional ICL, while maintaining frozen LLM parameters to reduce computational overhead.
Conclusion: LVICL successfully addresses dual challenge of prediction performance and compute overhead in LLM4TSF by enabling effective in-context learning through vector injection while keeping LLM parameters frozen, offering practical solution for web forecasting applications.
Abstract: The World Wide Web needs reliable predictive capabilities to respond to changes in user behavior and usage patterns. Time series forecasting (TSF) is a key means to achieve this goal. In recent years, the large language models (LLMs) for TSF (LLM4TSF) have achieved good performance. However, there is a significant difference between pretraining corpora and time series data, making it hard to guarantee forecasting quality when directly applying LLMs to TSF; fine-tuning LLMs can mitigate this issue, but often incurs substantial computational overhead. Thus, LLM4TSF faces a dual challenge of prediction performance and compute overhead. To address this, we aim to explore a method for improving the forecasting performance of LLM4TSF while freezing all LLM parameters to reduce computational overhead. Inspired by in-context learning (ICL), we propose LVICL. LVICL uses our vector-injected ICL to inject example information into a frozen LLM, eliciting its in-context learning ability and thereby enhancing its performance on the example-related task (i.e., TSF). Specifically, we first use the LLM together with a learnable context vector adapter to extract a context vector from multiple examples adaptively. This vector contains compressed, example-related information. Subsequently, during the forward pass, we inject this vector into every layer of the LLM to improve forecasting performance. Compared with conventional ICL that adds examples into the prompt, our vector-injected ICL does not increase prompt length; moreover, adaptively deriving a context vector from examples suppresses components harmful to forecasting, thereby improving model performance. Extensive experiments demonstrate the effectiveness of our approach.
[390] Contrastive and Multi-Task Learning on Noisy Brain Signals with Nonlinear Dynamical Signatures
Sucheta Ghosh, Zahra Monfared, Felix Dietrich
Main category: cs.LG
TL;DR: Two-stage multitask learning framework for EEG analysis combining denoising, dynamical modeling, and representation learning to improve robustness and decoding performance.
Details
Motivation: To create a more robust and generalizable EEG analysis framework that integrates noise reduction, dynamical feature extraction, and representation learning while avoiding interference between different learning objectives.Method: Two-stage approach: Stage 1 uses denoising autoencoder for artifact suppression and signal stabilization. Stage 2 employs multitask architecture with convolutional-Transformer backbone for motor imagery classification, chaotic regime discrimination (using Lyapunov exponents), and self-supervised contrastive learning with NT-Xent loss.
Result: The framework enhances robustness, generalization, and surpasses strong baselines and state-of-the-art methods in EEG decoding, demonstrating effectiveness of combining denoising, dynamical features, and self-supervised learning.
Conclusion: The staged multitask learning approach effectively integrates denoising, dynamical modeling, and representation learning for EEG analysis, improving performance and reproducibility while separating noise reduction from higher-level feature learning.
Abstract: We introduce a two-stage multitask learning framework for analyzing Electroencephalography (EEG) signals that integrates denoising, dynamical modeling, and representation learning. In the first stage, a denoising autoencoder is trained to suppress artifacts and stabilize temporal dynamics, providing robust signal representations. In the second stage, a multitask architecture processes these denoised signals to achieve three objectives: motor imagery classification, chaotic versus non-chaotic regime discrimination using Lyapunov exponent-based labels, and self-supervised contrastive representation learning with NT-Xent loss. A convolutional backbone combined with a Transformer encoder captures spatial-temporal structure, while the dynamical task encourages sensitivity to nonlinear brain dynamics. This staged design mitigates interference between reconstruction and discriminative goals, improves stability across datasets, and supports reproducible training by clearly separating noise reduction from higher-level feature learning. Empirical studies show that our framework not only enhances robustness and generalization but also surpasses strong baselines and recent state-of-the-art methods in EEG decoding, highlighting the effectiveness of combining denoising, dynamical features, and self-supervised learning.
[391] Your Group-Relative Advantage Is Biased
Fengkai Yang, Zherui Chen, Xiaohan Wang, Xiaodong Lu, Jiajun Chai, Guojun Yin, Wei Lin, Shuai Ma, Fuzhen Zhuang, Deqing Wang, Yaodong Yang, Jianxin Li, Yikun Ban
Main category: cs.LG
TL;DR: Group-based RL methods like GRPO have biased advantage estimators that underestimate hard prompts and overestimate easy ones, causing exploration/exploitation imbalance. The paper proposes HA-DW, an adaptive reweighting scheme to correct this bias, improving performance on math reasoning benchmarks.
Details
Motivation: Group-based RL methods (GRPO and variants) are widely used for post-training LLMs on reasoning tasks but rely on group-relative advantage estimation whose theoretical properties are poorly understood. The paper identifies a fundamental issue: the advantage estimator is inherently biased relative to the true expected advantage.Method: Proposes History-Aware Adaptive Difficulty Weighting (HA-DW), an adaptive reweighting scheme that adjusts advantage estimates based on an evolving difficulty anchor and training dynamics. This corrects the systematic bias in group-relative advantage estimation.
Result: Theoretical analysis shows HA-DW addresses the bias issue. Experiments on five mathematical reasoning benchmarks demonstrate that HA-DW consistently improves performance when integrated into GRPO and its variants.
Conclusion: Correcting biased advantage estimation is critical for robust and efficient RLVR training. HA-DW provides an effective solution to the fundamental issue in group-based RL methods.
Abstract: Reinforcement Learning from Verifier Rewards (RLVR) has emerged as a widely used approach for post-training large language models on reasoning tasks, with group-based methods such as GRPO and its variants gaining broad adoption. These methods rely on group-relative advantage estimation to avoid learned critics, yet its theoretical properties remain poorly understood. In this work, we uncover a fundamental issue of group-based RL: the group-relative advantage estimator is inherently biased relative to the true (expected) advantage. We provide the first theoretical analysis showing that it systematically underestimates advantages for hard prompts and overestimates them for easy prompts, leading to imbalanced exploration and exploitation. To address this issue, we propose History-Aware Adaptive Difficulty Weighting (HA-DW), an adaptive reweighting scheme that adjusts advantage estimates based on an evolving difficulty anchor and training dynamics. Both theoretical analysis and experiments on five mathematical reasoning benchmarks demonstrate that HA-DW consistently improves performance when integrated into GRPO and its variants. Our results suggest that correcting biased advantage estimation is critical for robust and efficient RLVR training.
[392] PROMA: Projected Microbatch Accumulation for Reference-Free Proximal Policy Updates
Nilin Abrahamsen
Main category: cs.LG
TL;DR: PROMA is a proximal policy optimization method that uses projected microbatch accumulation instead of likelihood ratios, achieving better KL control than GRPO without entropy collapse.
Details
Motivation: To develop a more stable proximal policy optimization method that avoids entropy collapse while providing tighter local KL control than existing methods like GRPO, moving away from reliance on likelihood ratios relative to a reference policy.Method: PROMA modifies gradient accumulation across microbatches by projecting partially accumulated gradients to be orthogonal to sequence-wise gradients of the current microbatch. This projection is applied layer-wise during backward pass for efficient implementation.
Result: Empirically achieves proximal updates without entropy collapse while providing tighter local KL control than GRPO.
Conclusion: PROMA offers an effective alternative to likelihood ratio-based methods for proximal policy optimization, with better stability and control properties.
Abstract: This note introduces Projected Microbatch Accumulation (PROMA), a proximal policy method that modifies gradient accumulation across microbatches rather than relying on likelihood ratios relative to a reference policy. During accumulation, PROMA projects the partially accumulated gradient to be orthogonal to the sequence-wise gradients of the current microbatch. This projection is applied layer-wise during the backward pass, enabling efficient implementation. Empirically, PROMA achieves proximal updates without entropy collapse while providing tighter local KL control than GRPO.
[393] Fairness-informed Pareto Optimization : An Efficient Bilevel Framework
Sofiane Tanji, Samuel Vaiter, Yassine Laguel
Main category: cs.LG
TL;DR: BADR is a bilevel adaptive rescalarization framework that finds Pareto-efficient models for any fairness metric, addressing limitations of existing fairness methods that often produce Pareto-inefficient solutions.
Details
Motivation: Existing fair ML methods often yield Pareto-inefficient models where some groups' performance could be improved without harming others. Traditional regularization approaches suffer from this, while existing Pareto-efficient methods are biased toward specific fairness perspectives and don't adapt to the wide range of fairness metrics in literature.Method: BADR uses a Bilevel Adaptive Rescalarisation procedure with lower-level weighted empirical risk minimization (weights as convex combination of groups) and upper-level optimization of the chosen fairness objective. Includes two novel large-scale, single-loop algorithms: BADR-GD and BADR-SGD with convergence guarantees.
Result: The framework recovers optimal Pareto-efficient models for any fairness metric. Authors release badr, an open-source Python toolbox for various learning tasks and fairness metrics. Extensive experiments show BADR’s advantages over existing Pareto-efficient fairness approaches.
Conclusion: BADR provides a flexible, scalable framework for achieving Pareto-efficient fairness across diverse metrics, overcoming limitations of previous methods through its bilevel optimization approach and efficient algorithms.
Abstract: Despite their promise, fair machine learning methods often yield Pareto-inefficient models, in which the performance of certain groups can be improved without degrading that of others. This issue arises frequently in traditional in-processing approaches such as fairness-through-regularization. In contrast, existing Pareto-efficient approaches are biased towards a certain perspective on fairness and fail to adapt to the broad range of fairness metrics studied in the literature. In this paper, we present BADR, a simple framework to recover the optimal Pareto-efficient model for any fairness metric. Our framework recovers its models through a Bilevel Adaptive Rescalarisation procedure. The lower level is a weighted empirical risk minimization task where the weights are a convex combination of the groups, while the upper level optimizes the chosen fairness objective. We equip our framework with two novel large-scale, single-loop algorithms, BADR-GD and BADR-SGD, and establish their convergence guarantees. We release badr, an open-source Python toolbox implementing our framework for a variety of learning tasks and fairness metrics. Finally, we conduct extensive numerical experiments demonstrating the advantages of BADR over existing Pareto-efficient approaches to fairness.
[394] Report for NSF Workshop on AI for Electronic Design Automation
Deming Chen, Vijay Ganesh, Weikai Li, Yingyan Celine Lin, Yong Liu, Subhasish Mitra, David Z. Pan, Ruchir Puri, Jason Cong, Yizhou Sun
Main category: cs.LG
TL;DR: NSF workshop report on AI for EDA, covering four key themes and recommendations for NSF investment in AI/EDA collaboration, infrastructure, and workforce development.
Details
Motivation: To examine how AI technologies (LLMs, GNNs, RL, neurosymbolic methods) can facilitate Electronic Design Automation and shorten design turnaround time by bringing together experts from machine learning and EDA fields.Method: Workshop organization with four thematic tracks: (1) AI for physical synthesis and DFM, (2) AI for high-level/logic-level synthesis, (3) AI toolbox for optimization and design, and (4) AI for test and verification, followed by expert discussions and recommendations.
Result: Identified key challenges and opportunities in applying AI to EDA across four domains, with specific recommendations for NSF to foster collaboration, invest in foundational AI for EDA, develop data infrastructures, promote scalable compute, and invest in workforce development.
Conclusion: AI has significant potential to transform EDA and enable next-generation hardware systems, requiring strategic NSF investments in collaboration, infrastructure, and workforce development to democratize hardware design.
Abstract: This report distills the discussions and recommendations from the NSF Workshop on AI for Electronic Design Automation (EDA), held on December 10, 2024 in Vancouver alongside NeurIPS 2024. Bringing together experts across machine learning and EDA, the workshop examined how AI-spanning large language models (LLMs), graph neural networks (GNNs), reinforcement learning (RL), neurosymbolic methods, etc.-can facilitate EDA and shorten design turnaround. The workshop includes four themes: (1) AI for physical synthesis and design for manufacturing (DFM), discussing challenges in physical manufacturing process and potential AI applications; (2) AI for high-level and logic-level synthesis (HLS/LLS), covering pragma insertion, program transformation, RTL code generation, etc.; (3) AI toolbox for optimization and design, discussing frontier AI developments that could potentially be applied to EDA tasks; and (4) AI for test and verification, including LLM-assisted verification tools, ML-augmented SAT solving, security/reliability challenges, etc. The report recommends NSF to foster AI/EDA collaboration, invest in foundational AI for EDA, develop robust data infrastructures, promote scalable compute infrastructure, and invest in workforce development to democratize hardware design and enable next-generation hardware systems. The workshop information can be found on the website https://ai4eda-workshop.github.io/.
[395] Adaptive Exponential Integration for Stable Gaussian Mixture Black-Box Variational Inference
Baojun Che, Yifan Chen, Daniel Zhengyu Huang, Xinying Mao, Weijie Wang
Main category: cs.LG
TL;DR: A stable and efficient black-box variational inference framework using Gaussian mixture families with affine-invariant preconditioning, exponential integrators, and adaptive time stepping.
Details
Motivation: Standard numerical optimization methods for black-box variational inference with Gaussian mixture families often suffer from instability and inefficiency, requiring a more robust framework.Method: Combines three key components: (1) affine-invariant preconditioning via natural gradient formulations, (2) an exponential integrator that preserves positive definiteness of covariance matrices, and (3) adaptive time stepping for stability and distinct warm-up/convergence phases.
Result: Proves exponential convergence for Gaussian posteriors in noise-free settings and almost-sure convergence under Monte Carlo estimation. Numerical experiments demonstrate effectiveness on multimodal distributions, Neal’s multiscale funnel, and a PDE-based Bayesian inverse problem for Darcy flow.
Conclusion: The proposed framework provides a stable and efficient approach for black-box variational inference with Gaussian mixture families, with theoretical guarantees and practical effectiveness demonstrated across challenging problems.
Abstract: Black-box variational inference (BBVI) with Gaussian mixture families offers a flexible approach for approximating complex posterior distributions without requiring gradients of the target density. However, standard numerical optimization methods often suffer from instability and inefficiency. We develop a stable and efficient framework that combines three key components: (1) affine-invariant preconditioning via natural gradient formulations, (2) an exponential integrator that unconditionally preserves the positive definiteness of covariance matrices, and (3) adaptive time stepping to ensure stability and to accommodate distinct warm-up and convergence phases. The proposed approach has natural connections to manifold optimization and mirror descent. For Gaussian posteriors, we prove exponential convergence in the noise-free setting and almost-sure convergence under Monte Carlo estimation, rigorously justifying the necessity of adaptive time stepping. Numerical experiments on multimodal distributions, Neal’s multiscale funnel, and a PDE-based Bayesian inverse problem for Darcy flow demonstrate the effectiveness of the proposed method.
[396] Recommending Best Paper Awards for ML/AI Conferences via the Isotonic Mechanism
Garrett G. Wen, Buxin Su, Natalie Collina, Zhun Deng, Weijie Su
Main category: cs.LG
TL;DR: Authors propose an author-assisted mechanism using the Isotonic Mechanism to elicit truthful author rankings of their own submissions, improving best paper award selection in large ML conferences.
Details
Motivation: Large ML conferences (NeurIPS, ICML) receive tens of thousands of submissions, making best paper award selection challenging and controversial. Current peer review processes struggle with quality and consistency for award selection.Method: Use Isotonic Mechanism to elicit authors’ rankings of their own submissions, then adjust raw review scores to optimally estimate ground-truth quality. Mechanism extended to handle overlapping authorship. Authors are incentivized to report truthfully under convex additive utility functions.
Result: Authors are incentivized to report truthfully when utility is convex additive; validated using ICLR (2019-2023) and NeurIPS (2021-2023) review data. For single-quota authors, truthfulness holds even with just nondecreasing additive utility. Simulations show mechanism significantly improves award paper quality.
Conclusion: Author-assisted mechanism using Isotonic Mechanism provides practical solution to improve best paper award selection in large ML conferences, with strong theoretical guarantees and empirical validation.
Abstract: Machine learning and artificial intelligence conferences such as NeurIPS and ICML now regularly receive tens of thousands of submissions, posing significant challenges to maintaining the quality and consistency of the peer review process. This challenge is particularly acute for best paper awards, which are an important part of the peer review process, yet whose selection has increasingly become a subject of debate in recent years. In this paper, we introduce an author-assisted mechanism to facilitate the selection of best paper awards. Our method employs the Isotonic Mechanism for eliciting authors’ assessments of their own submissions in the form of a ranking, which is subsequently utilized to adjust the raw review scores for optimal estimation of the submissions’ ground-truth quality. We demonstrate that authors are incentivized to report truthfully when their utility is a convex additive function of the adjusted scores, and we validate this convexity assumption for best paper awards using publicly accessible review data of ICLR from 2019 to 2023 and NeurIPS from 2021 to 2023. Crucially, in the special case where an author has a single quota – that is, may nominate only one paper – we prove that truthfulness holds even when the utility function is merely nondecreasing and additive. This finding represents a substantial relaxation of the assumptions required in prior work. For practical implementation, we extend our mechanism to accommodate the common scenario of overlapping authorship. Finally, simulation results demonstrate that our mechanism significantly improves the quality of papers selected for awards.
cs.MA
[397] Delayed Assignments in Online Non-Centroid Clustering with Stochastic Arrivals
Saar Cohen
Main category: cs.MA
TL;DR: A new framework for online non-centroid clustering with delays, where points arrive sequentially and can be assigned to clusters with postponement allowed at a delay cost. The paper presents a constant-competitive algorithm for stochastic arrivals.
Details
Motivation: Traditional online clustering requires immediate assignment decisions upon point arrival, which can lead to poor clustering quality. Allowing delayed assignments with delay costs provides more flexibility to make better clustering decisions, but introduces the challenge of balancing distance costs and delay costs.Method: The paper introduces a framework for online non-centroid clustering with delays, where points arrive sequentially in a finite metric space. Instead of requiring immediate assignment, algorithms can postpone decisions at a delay cost. The authors focus on stochastic arrivals where points are drawn independently from an unknown fixed distribution, and devise an algorithm that achieves constant competitive ratio.
Result: In worst-case arbitrary arrival model, no algorithm can achieve better than sublogarithmic competitive ratio. However, in the stochastic arrival model, the authors develop an algorithm that is constant competitive - as the number of points grows, the ratio between expected overall costs of the output clustering and optimal offline clustering is bounded by a constant.
Conclusion: The paper successfully demonstrates that beyond worst-case analysis (stochastic arrivals) enables constant-competitive online clustering with delays, overcoming strong impossibility results in worst-case models. This provides hope for practical online clustering algorithms that balance assignment quality with decision timing.
Abstract: Clustering is a fundamental problem, aiming to partition a set of elements, like agents or data points, into clusters such that elements in the same cluster are closer to each other than to those in other clusters. In this paper, we present a new framework for studying online non-centroid clustering with delays, where elements, that arrive one at a time as points in a finite metric space, should be assigned to clusters, but assignments need not be immediate. Specifically, upon arrival, each point’s location is revealed, and an online algorithm has to irrevocably assign it to an existing cluster or create a new one containing, at this moment, only this point. However, we allow decisions to be postponed at a delay cost, instead of following the more common assumption of immediate decisions upon arrival. This poses a critical challenge: the goal is to minimize both the total distance costs between points in each cluster and the overall delay costs incurred by postponing assignments. In the classic worst-case arrival model, where points arrive in an arbitrary order, no algorithm has a competitive ratio better than sublogarithmic in the number of points. To overcome this strong impossibility, we focus on a stochastic arrival model, where points’ locations are drawn independently across time from an unknown and fixed probability distribution over the finite metric space. We offer hope for beyond worst-case adversaries: we devise an algorithm that is constant competitive in the sense that, as the number of points grows, the ratio between the expected overall costs of the output clustering and an optimal offline clustering is bounded by a constant.
[398] Average Unfairness in Routing Games
Pan-Yang Su, Arwa Alanqary, Bryce L. Ferguson, Manxi Wu, Alexandre M. Bayen, Shankar Sastry
Main category: cs.MA
TL;DR: The paper introduces average unfairness as a new fairness measure in routing games, compares it with existing measures, and shows it enables better efficiency-fairness tradeoffs in constrained system optimization.
Details
Motivation: Existing unfairness measures in routing games (loaded unfairness and UE unfairness) don't capture average user experience. The authors propose average unfairness as a more comprehensive measure that considers the ratio between average latency and minimum latency, providing a natural complement to existing measures.Method: Theoretical analysis of three unfairness measures: average unfairness (new), loaded unfairness, and UE unfairness. They characterize worst-case values, establish relationships between measures, and study the constrained system optimum (CSO) problem with unfairness constraints. Use both analytical proofs and numerical examples.
Result: 1) Worst-case values of all three unfairness measures coincide and are characterized by latency function steepness. 2) Average unfairness ≤ loaded unfairness, with equality only when flow is fully fair. 3) For the same tolerance level, CSO with average unfairness constraint achieves lower total latency than with loaded unfairness constraint. 4) Improvement is strict in parallel-link networks and has sufficient conditions for general networks.
Conclusion: Average unfairness provides a valuable new fairness measure that enables better efficiency-fairness tradeoffs in network routing. It offers theoretical guarantees for evaluating these tradeoffs and shows practical advantages over existing measures in constrained optimization problems.
Abstract: We propose average unfairness as a new measure of fairness in routing games, defined as the ratio between the average latency and the minimum latency experienced by users. This measure is a natural complement to two existing unfairness notions: loaded unfairness, which compares maximum and minimum latencies of routes with positive flow, and user equilibrium (UE) unfairness, which compares maximum latency with the latency of a Nash equilibrium. We show that the worst-case values of all three unfairness measures coincide and are characterized by a steepness parameter intrinsic to the latency function class. We show that average unfairness is always no greater than loaded unfairness, and the two measures are equal only when the flow is fully fair. Besides that, we offer a complete comparison of the three unfairness measures, which, to the best of our knowledge, is the first theoretical analysis in this direction. Finally, we study the constrained system optimum (CSO) problem, where one seeks to minimize total latency subject to an upper bound on unfairness. We prove that, for the same tolerance level, the optimal flow under an average unfairness constraint achieves lower total latency than any flow satisfying a loaded unfairness constraint. We show that such improvement is always strict in parallel-link networks and establish sufficient conditions for general networks. We further illustrate the latter with numerical examples. Our results provide theoretical guarantees and valuable insights for evaluating fairness-efficiency tradeoffs in network routing.
[399] DISPATCH – Decentralized Informed Spatial Planning and Assignment of Tasks for Cooperative Heterogeneous Agents
Yao Liu, Sampad Mohanty, Elizabeth Ondula, Bhaskar Krishnamachari
Main category: cs.MA
TL;DR: The paper proposes two decentralized algorithms (EG-MARL and an online mechanism) that balance fairness and efficiency in spatial task allocation for multi-agent systems under partial observability, connecting Eisenberg-Gale equilibrium with multi-agent learning.
Details
Motivation: Greedy assignment policies in multi-agent systems (like delivery robots or ride-sharing) maximize efficiency but create unfairness - some tasks get favorable service while others face long waits. Existing approaches either assume centralized coordination or ignore fairness under partial observability.Method: 1) EG-MARL: Multi-agent reinforcement learning framework guided by centralized EG equilibrium assignment algorithm. 2) Stochastic online optimization mechanism performing guided exploration and subset-based fair assignment as tasks are discovered. Both connect Eisenberg-Gale equilibrium with decentralized, partially observable multi-agent learning.
Result: Both methods preserve fairness-efficiency balance of EG solution under partial observability. EG-MARL achieves near-centralized coordination and reduced travel distances. Online mechanism enables real-time allocation with competitive fairness. Evaluated on Multi-Agent Particle Environment simulations and Webots-based warehouse proof-of-concept.
Conclusion: The paper successfully bridges Eisenberg-Gale equilibrium theory with decentralized multi-agent learning, providing practical algorithms that maintain fairness-efficiency tradeoffs in partially observable spatial task allocation systems.
Abstract: Spatial task allocation in systems such as multi-robot delivery or ride-sharing requires balancing efficiency with fair service across tasks. Greedy assignment policies that match each agent to its highest-preference or lowest-cost task can maximize efficiency but often create inequities: some tasks receive disproportionately favorable service (e.g., shorter delays or better matches), while others face long waits or poor allocations. We study fairness in heterogeneous multi-agent systems where tasks vary in preference alignment and urgency. Most existing approaches either assume centralized coordination or largely ignore fairness under partial observability. Distinct from this prior work, we establish a connection between the Eisenberg-Gale (EG) equilibrium convex program and decentralized, partially observable multi-agent learning. Building on this connection, we develop two equilibrium-informed algorithms that integrate fairness and efficiency: (i) a multi-agent reinforcement learning (MARL) framework, EG-MARL, whose training is guided by a centralized EG equilibrium assignment algorithm; and (ii) a stochastic online optimization mechanism that performs guided exploration and subset-based fair assignment as tasks are discovered. We evaluate on Multi-Agent Particle Environment (MPE) simulations across varying team sizes against centralized EG, Hungarian, and Min-Max distance baselines, and also present a Webots-based warehouse proof-of-concept with heterogeneous robots. Both methods preserve the fairness-efficiency balance of the EG solution under partial observability, with EG-MARL achieving near-centralized coordination and reduced travel distances, and the online mechanism enabling real-time allocation with competitive fairness.
[400] VLM-CAD: VLM-Optimized Collaborative Agent Design Workflow for Analog Circuit Sizing
Guanyuan Pan, Shuai Wang, Yugui Lin, Tiansheng Zhou, Pietro Liò, Yaqi Wang, Zhenxin Zhao
Main category: cs.MA
TL;DR: VLM-CAD: A vision language model-based collaborative agent workflow for analog circuit sizing that combines schematic analysis with explainable optimization, achieving efficient sizing with physics-based explainability.
Details
Motivation: Existing analog circuit sizing methods rely only on netlists, ignoring schematic information, and lack explainability needed for industrial sign-off. Black-box ML methods and LLM hallucination risks fail to provide ground-truth explainability.Method: Proposes VLM-CAD workflow with Image2Net for schematic annotation and structured JSON generation for VLMs. Uses collaborative agents for circuit analysis, DC optimization, inference-based sizing, and external optimization. Introduces ExTuRBO for explainable Bayesian optimization with warm-start seeds and dual-granularity sensitivity analysis.
Result: Demonstrated on amplifier sizing tasks using 180nm, 90nm, and 45nm Predictive Technology Models. VLM-CAD effectively balances power and performance while maintaining physics-based explainability. Meets all specifications with low power consumption, optimizing an amplifier with complementary input and class-AB output stage in under 66 minutes total runtime.
Conclusion: VLM-CAD addresses the explainability gap in analog circuit sizing by integrating schematic analysis with vision language models and explainable optimization methods, providing a practical solution for industrial sign-off with cognitive links between schematics and performance.
Abstract: Analog mixed-signal circuit sizing involves complex trade-offs within high-dimensional design spaces. Existing automatic analog circuit sizing approaches rely solely on netlists, ignoring the circuit schematic, which hinders the cognitive link between the schematic and its performance. Furthermore, the black-box nature of machine learning methods and hallucination risks in large language models fail to provide the necessary ground-truth explainability required for industrial sign-off. To address these challenges, we propose a Vision Language Model-optimized collaborative agent design workflow (VLM-CAD), which analyzes circuits, optimizes DC operating points, performs inference-based sizing, and executes external sizing optimization. We integrate Image2Net to annotate circuit schematics and generate a structured JSON description for precise interpretation by Vision Language Models. Furthermore, we propose an Explainable Trust Region Bayesian Optimization method (ExTuRBO) that employs collaborative warm-start from agent-generated seeds and offers dual-granularity sensitivity analysis for external sizing optimization, supporting a comprehensive final design report. Experiment results on amplifier sizing tasks using 180nm, 90nm, and 45nm Predictive Technology Models demonstrate that VLM-CAD effectively balances power and performance while maintaining physics-based explainability. VLM-CAD meets all specification requirements while maintaining low power consumption in optimizing an amplifier with a complementary input and a class-AB output stage, with a total runtime under 66 minutes across all experiments on two amplifiers.
cs.MM
[401] Structured Image-based Coding for Efficient Gaussian Splatting Compression
Pedro Martin, Antonio Rodrigues, Joao Ascenso, Maria Paula Queluz
Main category: cs.MM
TL;DR: GSICO is a novel compression method for Gaussian Splatting models that arranges GS parameters into structured images for efficient encoding using conventional image codecs, achieving 20.2x compression with minimal quality loss.
Details
Motivation: Gaussian Splatting models require storing millions of parameters, leading to large file sizes that impair practical use in multimedia systems, creating a need for efficient compression while preserving visual fidelity.Method: GSICO uses a mapping procedure to arrange GS parameters into structured images guided by a novel algorithm that enhances spatial coherence, then encodes these parameter images using conventional image codecs.
Result: Experimental evaluations on Tanks and Temples, Deep Blending, and Mip-NeRF360 datasets show GSICO achieves average compression factors of 20.2x with minimal loss in visual quality (PSNR, SSIM, LPIPS) and superior rate-distortion trade-offs compared to state-of-the-art methods.
Conclusion: GSICO provides an effective solution for compressing Gaussian Splatting models, enabling practical deployment in multimedia systems by significantly reducing file sizes while maintaining perceptual fidelity.
Abstract: Gaussian Splatting (GS) has recently emerged as a state-of-the-art representation for radiance fields, combining real-time rendering with high visual fidelity. However, GS models require storing millions of parameters, leading to large file sizes that impair their use in practical multimedia systems. To address this limitation, this paper introduces GS Image-based Compression (GSICO), a novel GS codec that efficiently compresses pre-trained GS models while preserving perceptual fidelity. The core contribution lies in a mapping procedure that arranges GS parameters into structured images, guided by a novel algorithm that enhances spatial coherence. These GS parameter images are then encoded using a conventional image codec. Experimental evaluations on Tanks and Temples, Deep Blending, and Mip-NeRF360 datasets show that GSICO achieves average compression factors of 20.2x with minimal loss in visual quality, as measured by PSNR, SSIM, and LPIPS. Compared with state-of-the-art GS compression methods, the proposed codec consistently yields superior rate-distortion (RD) trade-offs.
eess.AS
[402] DynamicSound simulator for simulating moving sources and microphone arrays
Luca Barbisan, Marco Levorato, Fabrizio Riente
Main category: eess.AS
TL;DR: DynamicSound is an open-source acoustic simulation framework for generating realistic multichannel audio with moving sound sources and microphone arrays in 3D space, supporting Doppler effects, propagation delays, and environmental acoustics.
Details
Motivation: Existing acoustic simulators are limited to indoor environments and static sound sources, making them unsuitable for scenarios involving moving sources, moving microphones, or long-distance propagation needed for modern spatial audio and sound-source localization algorithms.Method: Developed an open-source framework that simulates multichannel audio from moving sound sources in 3D space, accounting for finite sound propagation delays, Doppler effects, distance-dependent attenuation, air absorption, and first-order reflections from planar surfaces.
Result: The framework generates temporally consistent spatial audio signals with high spatial fidelity across varying source positions and acoustic conditions, accurately reproducing inter-microphone time delays, level differences, and spectral coloration.
Conclusion: DynamicSound provides a flexible, reproducible tool for developing, training, and evaluating modern spatial audio and sound-source localization algorithms by enabling generation of realistic multichannel audio under controlled conditions.
Abstract: Developing algorithms for sound classification, detection, and localization requires large amounts of flexible and realistic audio data, especially when leveraging modern machine learning and beamforming techniques. However, most existing acoustic simulators are tailored for indoor environments and are limited to static sound sources, making them unsuitable for scenarios involving moving sources, moving microphones, or long-distance propagation. This paper presents DynamicSound an open-source acoustic simulation framework for generating multichannel audio from one or more sound sources with the possibility to move them continuously in three-dimensional space and recorded by arbitrarily configured microphone arrays. The proposed model explicitly accounts for finite sound propagation delays, Doppler effects, distance-dependent attenuation, air absorption, and first-order reflections from planar surfaces, yielding temporally consistent spatial audio signals. Unlike conventional mono or stereo simulators, the proposed system synthesizes audio for an arbitrary number of virtual microphones, accurately reproducing inter-microphone time delays, level differences, and spectral coloration induced by the environment. Comparative evaluations with existing open-source tools demonstrate that the generated signals preserve high spatial fidelity across varying source positions and acoustic conditions. By enabling the generation of realistic multichannel audio under controlled and repeatable conditions, the proposed open framework provides a flexible and reproducible tool for the development, training, and evaluation of modern spatial audio and sound-source localization algorithms.
[403] Attentive AV-FusionNet: Audio-Visual Quality Prediction with Hybrid Attention
Ina Salaj, Arijit Biswas
Main category: eess.AS
TL;DR: Novel deep learning AVQ model combines learned audio features with hand-crafted video features using attention mechanisms, outperforming prior fusion approaches.
Details
Motivation: Prior audio-visual quality prediction models use simple fusion strategies that don't effectively capture cross-modal interactions and intra-modal relationships needed for accurate quality assessment.Method: Hybrid representation combining learned GML audio features with hand-crafted VMAF video features, using attention mechanisms for cross-modal interactions and intra-modal relationships, plus modality relevance estimator.
Result: Improved AVQ prediction accuracy and robustness across diverse content types, with modality relevance estimator enabling potential adaptive bitrate allocation.
Conclusion: The proposed hybrid approach with attention mechanisms effectively captures complex audio-visual quality relationships, offering superior performance and practical applications for adaptive streaming.
Abstract: We introduce a novel deep learning-based audio-visual quality (AVQ) prediction model that leverages internal features from state-of-the-art unimodal predictors. Unlike prior approaches that rely on simple fusion strategies, our model employs a hybrid representation that combines learned Generative Machine Listener (GML) audio features with hand-crafted Video Multimethod Assessment Fusion (VMAF) video features. Attention mechanisms capture cross-modal interactions and intra-modal relationships, yielding context-aware quality representations. A modality relevance estimator quantifies each modality’s contribution per content, potentially enabling adaptive bitrate allocation. Experiments demonstrate improved AVQ prediction accuracy and robustness across diverse content types.
[404] Distributed Multichannel Active Noise Control with Asynchronous Communication
Junwei Ji, Dongyuan Shi, Boxiang Wang, Ziyi Yang, Haowen Li, Woon-Seng Gan
Main category: eess.AS
TL;DR: Proposes asynchronous communication for distributed multichannel active noise control to reduce communication overhead while maintaining noise reduction performance.
Details
Motivation: Conventional DMCANC methods require synchronous communication and frequent data exchange, leading to high communication overhead. Need more efficient and adaptable approach for heterogeneous networks.Method: Asynchronous communication strategy where nodes use weight-constrained filtered-x LMS algorithm and independently request communication only when local noise reduction degrades. Nodes transmit weight differences to update control filters and center points.
Result: Simulation shows proposed ACDMCANC maintains effective noise reduction with significantly reduced communication load, offering improved scalability for heterogeneous networks.
Conclusion: Asynchronous communication enables efficient distributed noise control with reduced communication overhead while preserving cooperative behavior between nodes.
Abstract: Distributed multichannel active noise control (DMCANC) offers effective noise reduction across large spatial areas by distributing the computational load of centralized control to multiple low-cost nodes. Conventional DMCANC methods, however, typically assume synchronous communication and require frequent data exchange, resulting in high communication overhead. To enhance efficiency and adaptability, this work proposes an asynchronous communication strategy where each node executes a weight-constrained filtered-x LMS (WCFxLMS) algorithm and independently requests communication only when its local noise reduction performance degrades. Upon request, other nodes transmit the weight difference between their local control filter and the center point in WCFxLMS, which are then integrated to update both the control filter and the center point. This design enables nodes to operate asynchronously while preserving cooperative behavior. Simulation results demonstrate that the proposed asynchronous communication DMCANC (ACDMCANC) system maintains effective noise reduction with significantly reduced communication load, offering improved scalability for heterogeneous networks.
[405] A Stabilized Hybrid Active Noise Control Algorithm of GFANC and FxNLMS with Online Clustering
Zhengding Luo, Haozhe Ma, Boxiang Wang, Ziyi Yang, Dongyuan Shi, Woon-Seng Gan
Main category: eess.AS
TL;DR: Hybrid GFANC-FxNLMS algorithm combines fast response of GFANC with low steady-state error of FxNLMS, using online clustering to prevent instability from frequent re-initializations.
Details
Motivation: FxNLMS has slow convergence and divergence risk despite low steady-state error, while GFANC offers fast response but lacks adaptability leading to large steady-state errors. Need to combine advantages of both approaches.Method: Proposes hybrid GFANC-FxNLMS algorithm where GFANC provides frame-level control filter as initialization for FxNLMS, and FxNLMS performs continuous adaptation at sampling rate. Introduces online clustering module to avoid unnecessary re-initializations that could destabilize the system.
Result: Simulation results show the proposed algorithm achieves fast response, very low steady-state error, and high stability, requiring only one pre-trained broadband filter.
Conclusion: The hybrid GFANC-FxNLMS algorithm successfully combines complementary advantages of both approaches, with online clustering ensuring system stability while maintaining performance benefits.
Abstract: The Filtered-x Normalized Least Mean Square (FxNLMS) algorithm suffers from slow convergence and a risk of divergence, although it can achieve low steady-state errors after sufficient adaptation. In contrast, the Generative Fixed-Filter Active Noise Control (GFANC) method offers fast response speed, but its lack of adaptability may lead to large steady-state errors. This paper proposes a hybrid GFANC-FxNLMS algorithm to leverage the complementary advantages of both approaches. In the hybrid GFANC-FxNLMS algorithm, GFANC provides a frame-level control filter as an initialization for FxNLMS, while FxNLMS performs continuous adaptation at the sampling rate. Small variations in the GFANC-generated filter may repeatedly reinitialize FxNLMS, interrupting its adaptation process and destabilizing the system. An online clustering module is introduced to avoid unnecessary re-initializations and improve system stability. Simulation results show that the proposed algorithm achieves fast response, very low steady-state error, and high stability, requiring only one pre-trained broadband filter.
[406] Timbre-Aware LLM-based Direct Speech-to-Speech Translation Extendable to Multiple Language Pairs
Lalaram Arya, Mrinmoy Bhattacharjee, Adarsh C. R., S. R. Mahadeva Prasanna
Main category: eess.AS
TL;DR: DS2ST-LM: A scalable direct speech-to-speech translation framework using multilingual LLMs that outperforms cascaded systems while preserving speaker identity through timbre-controlled synthesis.
Details
Motivation: Direct S2ST systems face challenges with semantic-acoustic alignment instability when parallel speech data is scarce, difficulty preserving speaker identity, and limited multilingual scalability. The authors aim to address these limitations.Method: Propose DS2ST-LM: a single-stage framework with Whisper speech encoder, learnable projection module, Qwen2-0.5B LLM, and timbre-controlled vocoder. Create GigaS2S-1000 corpus (1000-hour bilingual) with synthetic target speech. Compare semantic token strategies (S3 vs text-derived) and projection architectures (Linear, Conv1D-Linear, Q-Former).
Result: Outperforms cascaded and ST+TTS baselines on lexical (BLEU, METEOR) and semantic (BLEURT, COMET) metrics across multiple language pairs (French, Spanish, German, Hindi, Bengali, Urdu). Achieves better speaker similarity and perceptual naturalness than prior direct S2ST systems. Simple Linear projector performs best despite higher-capacity alternatives converging faster.
Conclusion: DS2ST-LM demonstrates effective direct speech-to-speech translation using LLMs, showing that synthetic data alleviates data scarcity, simple projection works best, and timbre-aware synthesis successfully preserves speaker identity while achieving multilingual scalability.
Abstract: Direct Speech-to-Speech Translation (S2ST) has gained increasing attention for its ability to translate speech from one language to another, while reducing error propagation and latency inherent in traditional cascaded pipelines. However, existing direct S2ST systems continue to face notable challenges, including instability in semantic-acoustic alignment when parallel speech data is scarce, difficulty in preserving speaker identity, and limited multilingual scalability. In this work, we introduce DS2ST-LM, a scalable, single-stage direct S2ST framework leveraging a multilingual Large Language Model (LLM). The architecture integrates a Whisper speech encoder, a learnable projection module, a Qwen2-0.5B LLM, and a timbre-controlled vocoder. We construct GigaS2S-1000, a 1000-hour bilingual corpus by extending the GigaST dataset with high-fidelity synthetic target speech, and show that this synthetic data alleviates data scarcity to some extent. We investigate two semantic token generation strategies: speech-derived S3 tokens and text-derived tokens generated by a pre-trained LLM, and analyze their impact on training stability and semantic consistency. We further evaluate three projection architectures (Linear, Conv1D-Linear, and Q-Former) and observe that while higher-capacity projectors converge faster, the simple Linear projector achieves higher performance. Extensive experiments demonstrate that DS2ST-LM outperforms traditional cascaded and ST (Qwen-Audio) + TTS baselines across both lexical (BLEU, METEOR) and semantic (BLEURT, COMET) metrics, while extending to multiple language pairs, including French, Spanish, German, Hindi, Bengali, and Urdu. Furthermore, we incorporate timbre-aware speech synthesis to preserve speaker information, enabling DS2ST-LM to surpass prior direct S2ST systems in both speaker similarity and perceptual naturalness.
[407] Loose coupling of spectral and spatial models for multi-channel diarization and enhancement of meetings in dynamic environments
Adrian Meise, Tobias Cord-Landwehr, Christoph Boeddeker, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach
Main category: eess.AS
TL;DR: Proposes a joint spatial-spectral mixture model for speaker diarization that handles moving speakers by loosely coupling speaker and position information probabilistically.
Details
Motivation: Traditional microphone array systems assume fixed speaker positions, but speakers move in real meetings. There's no one-to-one mapping between positions and speakers when speakers move, creating a challenge for diarization and signal enhancement.Method: A novel joint spatial and spectral mixture model with two loosely coupled submodels. The relationship between speaker identity and position index is modeled probabilistically, allowing spatial and spectral information to be jointly exploited while accommodating speakers speaking from different positions.
Result: Experiments on LibriCSS dataset with simulated speaker position changes show great improvements over tightly coupled subsystems.
Conclusion: The proposed loosely coupled approach effectively handles moving speakers in meeting transcription by probabilistically modeling speaker-position relationships, outperforming traditional tightly coupled systems.
Abstract: Sound capture by microphone arrays opens the possibility to exploit spatial, in addition to spectral, information for diarization and signal enhancement, two important tasks in meeting transcription. However, there is no one-to-one mapping of positions in space to speakers if speakers move. Here, we address this by proposing a novel joint spatial and spectral mixture model, whose two submodels are loosely coupled by modeling the relationship between speaker and position index probabilistically. Thus, spatial and spectral information can be jointly exploited, while at the same time allowing for speakers speaking from different positions. Experiments on the LibriCSS data set with simulated speaker position changes show great improvements over tightly coupled subsystems.
[408] MOSA: Mixture of Simple Adapters Outperforms Monolithic Approaches in LLM-based Multilingual ASR
Junjie Li, Jing Peng, Yangui Fang, Shuai Wang, Kai Yu
Main category: eess.AS
TL;DR: MOSA uses Mixture of Simple Adapters for multilingual LLM-based ASR, achieving better performance with fewer parameters by specializing experts for language-shared/specific knowledge.
Details
Motivation: Single projectors struggle to align representations across different languages in LLM-based ASR, and scaling data/model parameters isn't efficient for multilingual alignment.Method: Proposes MOSA (Mixture of Simple Adapters) - aggregates multiple simple adapters where different experts specialize in learning language-shared or language-specific knowledge.
Result: MOSA-Base achieves 15.4% relative WER reduction vs Ideal-LLM Base across all languages; achieves 13.3% WER reduction with only 60% of parameters.
Conclusion: Mixture of simple adapters is more suitable for multilingual LLM-based ASR than complex single-adapter designs, offering superior parameter efficiency and robustness against data imbalance.
Abstract: LLM-based ASR overcomes multilingual data scarcity by projecting speech representations into the LLM space to leverage its robust semantic and reasoning capabilities. However, while previous approaches typically enhance performance by scaling data or model parameters, a single projector often struggles to effectively align representations across different languages. In this work, we propose an MoE-based projector named MOSA (Mixture of Simple Adapters). By aggregating multiple simple adapters, this architecture enables different experts to specialize in learning either language-shared or language-specific knowledge. This approach not only mitigates parameter interference between languages but also facilitates positive transfer from high-resource to low-resource languages, effectively alleviating data scarcity issues. Experimental results demonstrate that MOSA-Base achieves a 15.4% relative reduction in average WER compared to the Ideal-LLM Base, consistently outperforming it across all languages. Notably, MOSA achieves a 13.3% WER reduction over the Ideal-LLM Base while utilizing only 60% of its parameters. These findings highlight MOSA’s superior parameter efficiency and robustness against data imbalance, suggesting that a mixture of simple adapters is more suitable for multilingual LLM-based ASR than complex single-adapter designs.
[409] Towards Evaluating Generative Audio: Insights from Neural Audio Codec Embedding Distances
Arijit Biswas, Lars Villemoes
Main category: eess.AS
TL;DR: Enhanced neural audio codec (DACe) improves audio compression and provides better features for perceptual quality evaluation, with FAD outperforming MMD in correlation with human judgments.
Details
Motivation: Neural audio codecs can serve dual purposes: compression and perceptual quality evaluation. The paper aims to enhance codec fidelity and systematically compare different embedding methods for audio quality assessment.Method: Developed DACe (enhanced Descript Audio Codec) trained on diverse tonal data with balanced sampling. Compared Fréchet Audio Distance (FAD) and Maximum Mean Discrepancy (MMD) on MUSHRA tests across speech, music, and mixed content. Evaluated embeddings from various NACs and other models like CLAP and OpenL3.
Result: FAD consistently outperformed MMD in correlation with human judgments. Higher-fidelity NACs (like DACe) showed stronger correlations. While CLAP-M and OpenL3-128M achieved higher correlations, NAC embeddings provide practical zero-shot audio quality assessment using only unencoded audio for training.
Conclusion: Neural audio codecs demonstrate dual utility for both compression and perceptually informed audio evaluation, with enhanced codecs like DACe providing better features for quality assessment while maintaining practical advantages for zero-shot applications.
Abstract: Neural audio codecs (NACs) achieve low-bitrate compression by learning compact audio representations, which can also serve as features for perceptual quality evaluation. We introduce DACe, an enhanced, higher-fidelity version of the Descript Audio Codec (DAC), trained on diverse real and synthetic tonal data with balanced sampling. We systematically compare Fréchet Audio Distance (FAD) and Maximum Mean Discrepancy (MMD) on MUSHRA tests across speech, music, and mixed content. FAD consistently outperforms MMD, and embeddings from higher-fidelity NACs (such as DACe) show stronger correlations with human judgments. While CLAP LAION Music (CLAP-M) and OpenL3 Mel128 (OpenL3-128M) embeddings achieve higher correlations, NAC embeddings provide a practical zero-shot approach to audio quality assessment, requiring only unencoded audio for training. These results demonstrate the dual utility of NACs for compression and perceptually informed audio evaluation.
[410] AQA-TTRL: Self-Adaptation in Audio Question Answering with Test-Time Reinforcement Learning
Haoyu Zhang, Jiaxian Guo, Yusuke Iwasawa, Yutaka Matsuo
Main category: eess.AS
TL;DR: AQA-TTRL enables Large Audio Language Models to improve on-the-fly using only unlabeled test data through pseudo-label generation and reinforcement learning with noise handling techniques.
Details
Motivation: Current Large Audio Language Models are static after deployment and cannot improve with new real-world audio data, while supervised fine-tuning is too costly.Method: A novel test-time adaptation framework that: 1) generates pseudo-labels via majority voting from predictions, 2) optimizes via reinforcement learning, 3) uses confidence-based weighting to handle noisy labels, and 4) employs multiple-attempt sampling to prevent advantage collapse.
Result: Achieves significant improvements: 4.42% average improvement for Qwen2.5-Omni 7B model and 11.04% for 3B model on MMAU, MMAR, and MMSU benchmarks. The adapted 3B model consistently outperforms the unadapted 7B model.
Conclusion: Test-time adaptation is effective for audio understanding, enabling smaller models to outperform larger unadapted models through on-the-fly learning from unlabeled test data.
Abstract: Large Audio Language Models (LALMs) demonstrate impressive general audio understanding, but once deployed, they are static and fail to improve with new real-world audio data. As traditional supervised fine-tuning is costly, we introduce a novel framework for test-time audio understanding, AQA-TTRL, where an LALM evolves on-the-fly using only unlabeled test data. It first generates pseudo-labels from the prediction via majority voting, then optimizes the model via reinforcement learning. To handle the inherent noise in these self-generated labels, we introduce a confidence-based weighting method to adjust training signals. Furthermore, a multiple-attempt sampling operation mitigates advantage collapse and stabilizes training. On the MMAU (test-mini/test), MMAR, and MMSU benchmarks, AQA-TTRL achieves significant average improvements of 4.42% for the Qwen2.5-Omni 7B model and 11.04% for the 3B model. Notably, the adapted 3B model consistently outperforms the direct inference of the unadapted 7B model, highlighting the effectiveness of previously unexplored test-time adaptations in audio understanding.
[411] Quantization-Based Score Calibration for Few-Shot Keyword Spotting with Dynamic Time Warping in Noisy Environments
Kevin Wilkinghoff, Alessia Cornaggia-Urrigshardt, Zheng-Hua Tan
Main category: eess.AS
TL;DR: Proposes score calibration for template-based open-set few-shot keyword spotting to improve threshold selection robustness in noisy environments.
Details
Motivation: Traditional threshold selection for keyword spotting systems often leads to suboptimal performance on unseen data, especially in varying/noisy acoustic environments or few-shot settings, due to greedy optimization on validation data.Method: Proposes embedding-level score calibration by quantizing learned representations and applying quantization error-based normalization before DTW-based scoring and thresholding for template-based open-set few-shot KWS.
Result: Experiments on KWS-DailyTalk with simulated high frequency radio channels show the calibration approach simplifies selection of robust detection thresholds and significantly improves performance.
Conclusion: The proposed score calibration method effectively mitigates performance degradation from suboptimal thresholds in noisy few-shot keyword spotting scenarios.
Abstract: Detecting occurrences of keywords with keyword spotting (KWS) systems requires thresholding continuous detection scores. Selecting appropriate thresholds is a non-trivial task, typically relying on optimizing performance on a validation dataset. However, such greedy threshold selection often leads to suboptimal performance on unseen data, particularly in varying or noisy acoustic environments or few-shot settings. In this work, we investigate detection threshold estimation for template-based open-set few-shot KWS using dynamic time warping on noisy speech data. To mitigate the performance degradation caused by suboptimal thresholds, we propose a score calibration approach that operates at the embedding level by quantizing learned representations and applying quantization error-based normalization prior to DTW-based scoring and thresholding. Experiments on KWS-DailyTalk with simulated high frequency radio channels show that the proposed calibration approach simplifies the selection of robust detection thresholds and significantly improves the resulting performance.
[412] Principled Coarse-Grained Acceptance for Speculative Decoding in Speech
Moran Yanuka, Paul Dixon, Eyal Finkelshtein, Daniel Rotman, Raja Giryes
Main category: eess.AS
TL;DR: PCG accelerates speech generation by allowing draft tokens to be accepted if they belong to the same acoustic similarity group as the target model’s token, rather than requiring exact token matching.
Details
Motivation: Standard speculative decoding for speech LLMs is too restrictive because many discrete acoustic tokens are acoustically or semantically interchangeable, leading to low acceptance rates and limited speedups.Method: Principled Coarse-Graining (PCG) verifies proposals at the level of Acoustic Similarity Groups (ASGs) derived from the target model’s embedding space. It splits each token’s probability mass across overlapping groups and performs rejection sampling on the group variable.
Result: On LibriTTS, PCG increases acceptance rates and throughput relative to standard speculative decoding and prior speech-specific relaxations while maintaining intelligibility and speaker similarity.
Conclusion: Acoustically aware, group-level acceptance is a simple and general way to accelerate speech token generation while maintaining speech quality.
Abstract: Speculative decoding accelerates autoregressive speech generation by letting a fast draft model propose tokens that a larger target model verifies. However, for speech LLMs that generate acoustic tokens, exact token matching is overly restrictive: many discrete tokens are acoustically or semantically interchangeable, reducing acceptance rates and limiting speedups. We introduce Principled Coarse-Graining (PCG), which verifies proposals at the level of Acoustic Similarity Groups (ASGs) derived from the target model’s embedding space. By splitting each token’s probability mass across the overlapping groups that contain it, we define an overlap-aware coarse-grained distribution and perform rejection sampling on the resulting group variable. This yields an exactness guarantee at the group level while allowing the accepted draft token to stand in for any member of the group in practice. On LibriTTS, PCG increases acceptance and throughput relative to standard speculative decoding and prior speech-specific relaxations while maintaining intelligibility and speaker similarity. These results suggest acoustically aware, group-level acceptance as a simple and general way to accelerate speech token generation while maintaining speech quality.
eess.IV
[413] Q-Probe: Scaling Image Quality Assessment to High Resolution via Context-Aware Agentic Probing
Xiang Li, XueHeng Li, Yu Wang, XuanHua He, ZhangChi Hu, WeiWei Yu, ChengJun Xie
Main category: eess.IV
TL;DR: Q-Probe is an agentic IQA framework that addresses limitations of existing RL-based models in high-resolution scenarios by using context-aware probing to capture subtle local degradations without spurious biases.
Details
Motivation: Existing RL-based IQA models rely on coarse-grained global views and fail to capture subtle local degradations in high-resolution scenarios. Current "Thinking with Images" paradigms adapted to IQA introduce spurious "cropping-implies-degradation" biases and misinterpret natural depth-of-field as artifacts.Method: Proposes Q-Probe, the first agentic IQA framework designed to scale IQA to high resolution via context-aware probing. Includes: 1) Vista-Bench benchmark for fine-grained local degradation analysis in high-resolution IQA, and 2) A three-stage training paradigm that progressively aligns models with human preferences while eliminating causal bias through a novel context-aware cropping strategy.
Result: Extensive experiments demonstrate that Q-Probe achieves state-of-the-art performance in high-resolution settings while maintaining superior efficacy across resolution scales.
Conclusion: Q-Probe successfully addresses the challenges of high-resolution IQA by introducing context-aware probing and eliminating spurious biases, establishing a new framework for fine-grained image quality assessment that works effectively across different resolution scales.
Abstract: Reinforcement Learning (RL) has empowered Multimodal Large Language Models (MLLMs) to achieve superior human preference alignment in Image Quality Assessment (IQA). However, existing RL-based IQA models typically rely on coarse-grained global views, failing to capture subtle local degradations in high-resolution scenarios. While emerging “Thinking with Images” paradigms enable multi-scale visual perception via zoom-in mechanisms, their direct adaptation to IQA induces spurious “cropping-implies-degradation” biases and misinterprets natural depth-of-field as artifacts. To address these challenges, we propose Q-Probe, the first agentic IQA framework designed to scale IQA to high resolution via context-aware probing. First, we construct Vista-Bench, a pioneering benchmark tailored for fine-grained local degradation analysis in high-resolution IQA settings. Furthermore, we propose a three-stage training paradigm that progressively aligns the model with human preferences, while simultaneously eliminating causal bias through a novel context-aware cropping strategy. Extensive experiments demonstrate that Q-Probe achieves state-of-the-art performance in high-resolution settings while maintaining superior efficacy across resolution scales.
[414] High-Fidelity 3D Tooth Reconstruction by Fusing Intraoral Scans and CBCT Data via a Deep Implicit Representation
Yi Zhu, Razmig Kechichian, Raphaël Richert, Satoshi Ikehata, Sébastien Valette
Main category: eess.IV
TL;DR: Fully-automated pipeline fuses CBCT (root) and IOS (crown) data using deep implicit representation to create seamless 3D tooth models.
Details
Motivation: Clinical imaging limitations: CBCT captures root but has noisy crown, IOS provides high-fidelity crown but no root. Naive fusion creates unnatural seams and artifacts.Method: Segments and registers tooth instances, creates hybrid proxy mesh (IOS crown + CBCT root), uses DeepSDF network guided by proxy to project onto learned manifold of ideal tooth shapes.
Result: Generates seamless, watertight, anatomically coherent models that preserve high-fidelity crown from IOS and patient-specific root morphology from CBCT.
Conclusion: Method overcomes limitations of individual modalities and naive stitching, providing complete high-fidelity 3D tooth models for digital dentistry.
Abstract: High-fidelity 3D tooth models are essential for digital dentistry, but must capture both the detailed crown and the complete root. Clinical imaging modalities are limited: Cone-Beam Computed Tomography (CBCT) captures the root but has a noisy, low-resolution crown, while Intraoral Scanners (IOS) provide a high-fidelity crown but no root information. A naive fusion of these sources results in unnatural seams and artifacts. We propose a novel, fully-automated pipeline that fuses CBCT and IOS data using a deep implicit representation. Our method first segments and robustly registers the tooth instances, then creates a hybrid proxy mesh combining the IOS crown and the CBCT root. The core of our approach is to use this noisy proxy to guide a class-specific DeepSDF network. This optimization process projects the input onto a learned manifold of ideal tooth shapes, generating a seamless, watertight, and anatomically coherent model. Qualitative and quantitative evaluations show our method uniquely preserves both the high-fidelity crown from IOS and the patient-specific root morphology from CBCT, overcoming the limitations of each modality and naive stitching.
[415] An Efficient Quality Metric for Video Frame Interpolation Based on Motion-Field Divergence
Conall Daly, Darren Ramsook, Anil Kokaram
Main category: eess.IV
TL;DR: PSNR_DIV is a novel full-reference quality metric for video frame interpolation that enhances PSNR with motion divergence weighting, achieving better correlation with human perception than FloLPIPS while being 2.5× faster and using 4× less memory.
Details
Motivation: Existing quality metrics (PSNR, SSIM, LPIPS) ignore temporal coherence, while specialized metrics like FloLPIPS are computationally inefficient, limiting practical application for evaluating video frame interpolation quality.Method: PSNR_DIV enhances PSNR through motion divergence weighting, adapted from archival film restoration. It detects temporal inconsistencies by highlighting singularities in motion fields and uses this to weight image errors.
Result: On BVI-VFI dataset (180 sequences), PSNR_DIV achieves +0.09 Pearson Linear Correlation Coefficient improvement over FloLPIPS, while being 2.5× faster and using 4× less memory. Performance is consistent across content categories and robust to motion estimator choice.
Conclusion: PSNR_DIV provides efficient and accurate quality evaluation for video frame interpolation, enabling practical use as a loss function for training neural networks in VFI tasks.
Abstract: Video frame interpolation is a fundamental tool for temporal video enhancement, but existing quality metrics struggle to evaluate the perceptual impact of interpolation artefacts effectively. Metrics like PSNR, SSIM and LPIPS ignore temporal coherence. State-of-the-art quality metrics tailored towards video frame interpolation, like FloLPIPS, have been developed but suffer from computational inefficiency that limits their practical application. We present $\text{PSNR}{\text{DIV}}$, a novel full-reference quality metric that enhances PSNR through motion divergence weighting, a technique adapted from archival film restoration where it was developed to detect temporal inconsistencies. Our approach highlights singularities in motion fields which is then used to weight image errors. Evaluation on the BVI-VFI dataset (180 sequences across multiple frame rates, resolutions and interpolation methods) shows $\text{PSNR}{\text{DIV}}$ achieves statistically significant improvements: +0.09 Pearson Linear Correlation Coefficient over FloLPIPS, while being 2.5$\times$ faster and using 4$\times$ less memory. Performance remains consistent across all content categories and are robust to the motion estimator used. The efficiency and accuracy of $\text{PSNR}_{\text{DIV}}$ enables fast quality evaluation and practical use as a loss function for training neural networks for video frame interpolation tasks. An implementation of our metric is available at www.github.com/conalld/psnr-div.
[416] Aligned Stable Inpainting: Mitigating Unwanted Object Insertion and Preserving Color Consistency
Yikai Wang, Junqiu Yu, Chenjie Cao, Xiangyang Xue, Yanwei Fu
Main category: eess.IV
TL;DR: ASUKA is a post-hoc framework that addresses unwanted object hallucination and color inconsistency in generative image inpainting by using reconstruction-based priors and a specialized VAE decoder for local harmonization.
Details
Motivation: Existing generative inpainting methods produce unnatural results due to two main issues: (1) unwanted object insertion where models hallucinate arbitrary objects that don't match context, and (2) color inconsistency leading to noticeable color shifts and smeared textures.Method: ASUKA framework uses reconstruction-based priors to suppress object hallucination while preserving generative flexibility, and a specialized VAE decoder that formulates latent-to-image decoding as a local harmonization task to reduce color shifts. Implemented on U-Net and DiT architectures with lightweight injection strategies.
Result: ASUKA effectively suppresses object hallucination and improves color consistency, outperforming standard diffusion models, rectified flow models, and other inpainting methods on Places2 dataset and the proposed MISATO benchmark.
Conclusion: ASUKA provides efficient post-hoc solutions for pre-trained inpainting models that address key issues of object hallucination and color inconsistency while maintaining generative capacity, with code and models to be released.
Abstract: Generative image inpainting can produce realistic, high-fidelity results even with large, irregular masks. However, existing methods still face key issues that make inpainted images look unnatural. In this paper, we identify two main problems: (1) Unwanted object insertion: generative models may hallucinate arbitrary objects in the masked region that do not match the surrounding context. (2) Color inconsistency: inpainted regions often exhibit noticeable color shifts, leading to smeared textures and degraded image quality. We analyze the underlying causes of these issues and propose efficient post-hoc solutions for pre-trained inpainting models. Specifically, we introduce the principled framework of Aligned Stable inpainting with UnKnown Areas prior (ASUKA). To reduce unwanted object insertion, we use reconstruction-based priors to guide the generative model, suppressing hallucinated objects while preserving generative flexibility. To address color inconsistency, we design a specialized VAE decoder that formulates latent-to-image decoding as a local harmonization task. This design significantly reduces color shifts and produces more color-consistent results. We implement ASUKA on two representative inpainting architectures: a U-Net-based model and a DiT-based model. We analyze and propose lightweight injection strategies that minimize interference with the model’s original generation capacity while ensuring the mitigation of the two issues. We evaluate ASUKA using the Places2 dataset and MISATO, our proposed diverse benchmark. Experiments show that ASUKA effectively suppresses object hallucination and improves color consistency, outperforming standard diffusion, rectified flow models, and other inpainting methods. Dataset, models and codes will be released in github.
[417] OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation
Letian Zhang, Sucheng Ren, Yanqing Liu, Xianhang Li, Zeyu Wang, Yuyin Zhou, Huaxiu Yao, Zeyu Zheng, Weili Nie, Guilin Liu, Zhiding Yu, Cihang Xie
Main category: eess.IV
TL;DR: OpenVision 3 is a unified vision encoder that learns a single representation for both image understanding and generation by combining reconstruction and semantic objectives in a shared latent space.
Details
Motivation: Current vision systems typically use separate models for image understanding (like CLIP) and image generation (like VAE encoders), creating a disconnect between these two fundamental visual capabilities. The authors aim to develop a single, unified visual representation that can serve both purposes effectively.Method: The architecture feeds VAE-compressed image latents to a ViT encoder. The encoder output serves dual purposes: 1) passed to a ViT-VAE decoder for image reconstruction (capturing generative structure), and 2) optimized with contrastive learning and image-captioning objectives (strengthening semantic features). The model is jointly trained with both reconstruction- and semantics-driven signals in a shared latent space.
Result: The unified encoder performs comparably to standard CLIP for multimodal understanding (62.4 vs 62.2 on SeedBench, 83.7 vs 82.9 on POPE) and substantially surpasses CLIP-based encoders for generation (gFID: 1.89 vs 2.54 on ImageNet). The frozen encoder shows strong generalization across both understanding and generation tasks.
Conclusion: OpenVision 3 demonstrates that a single, unified visual representation can effectively serve both image understanding and generation tasks, bridging the gap between these traditionally separate domains. The work encourages future research on unified visual modeling approaches.
Abstract: This paper presents a family of advanced vision encoder, named OpenVision 3, that learns a single, unified visual representation that can serve both image understanding and image generation. Our core architecture is simple: we feed VAE-compressed image latents to a ViT encoder and train its output to support two complementary roles. First, the encoder output is passed to the ViT-VAE decoder to reconstruct the original image, encouraging the representation to capture generative structure. Second, the same representation is optimized with contrastive learning and image-captioning objectives, strengthening semantic features. By jointly optimizing reconstruction- and semantics-driven signals in a shared latent space, the encoder learns representations that synergize and generalize well across both regimes. We validate this unified design through extensive downstream evaluations with the encoder frozen. For multimodal understanding, we plug the encoder into the LLaVA-1.5 framework: it performs comparably with a standard CLIP vision encoder (e.g., 62.4 vs 62.2 on SeedBench, and 83.7 vs 82.9 on POPE). For generation, we test it under the RAE framework: ours substantially surpasses the standard CLIP-based encoder (e.g., gFID: 1.89 vs 2.54 on ImageNet). We hope this work can spur future research on unified modeling.
[418] A Machine Vision Approach to Preliminary Skin Lesion Assessments
Ali Khreis, Ro’Yah Radaideh, Quinn McGill
Main category: eess.IV
TL;DR: A study comparing rule-based ABCD dermoscopy scoring with machine learning for skin lesion classification, finding that custom CNNs outperform both traditional methods and transfer learning approaches on medical image datasets.
Details
Motivation: Early detection of malignant skin lesions is critical for improving patient outcomes in aggressive, metastatic skin cancers. The study aims to evaluate automated systems for preliminary skin lesion assessment.Method: The study combines the clinically established ABCD rule of dermoscopy with machine learning classification. It uses a 1,000-image subset of HAM10000 dataset to compute Total Dermoscopy Score (TDS) via automated rule-based pipeline. This is compared against traditional classifiers (Logistic Regression, Random Forest, SVM) and deep learning models including transfer learning with EfficientNet-B0 and a custom three-layer CNN trained from scratch.
Result: Rule-based system showed performance bottleneck when reducing complex morphology to five numerical features. Transfer learning with EfficientNet-B0 failed significantly due to domain shift between natural and medical images. Custom three-layer CNN achieved 78.5% accuracy and 86.5% recall on median-filtered images, representing a 19-point accuracy improvement over traditional methods.
Conclusion: Direct pixel-level learning captures diagnostic patterns beyond handcrafted features, and purpose-built lightweight architectures can outperform large pretrained models for small, domain-specific medical datasets. Custom CNNs trained from scratch are more effective for medical image analysis than transfer learning approaches that suffer from domain shift issues.
Abstract: Early detection of malignant skin lesions is critical for improving patient outcomes in aggressive, metastatic skin cancers. This study evaluates a comprehensive system for preliminary skin lesion assessment that combines the clinically established ABCD rule of dermoscopy (analyzing Asymmetry, Borders, Color, and Dermoscopic Structures) with machine learning classification. Using a 1,000-image subset of the HAM10000 dataset, the system implements an automated, rule-based pipeline to compute a Total Dermoscopy Score (TDS) for each lesion. This handcrafted approach is compared against various machine learning solutions, including traditional classifiers (Logistic Regression, Random Forest, and SVM) and deep learning models. While the rule-based system provides high clinical interpretability, results indicate a performance bottleneck when reducing complex morphology to five numerical features. Experimental findings show that transfer learning with EfficientNet-B0 failed significantly due to domain shift between natural and medical images. In contrast, a custom three-layer Convolutional Neural Network (CNN) trained from scratch achieved 78.5% accuracy and 86.5% recall on median-filtered images, representing a 19-point accuracy improvement over traditional methods. The results demonstrate that direct pixel-level learning captures diagnostic patterns beyond handcrafted features and that purpose-built lightweight architectures can outperform large pretrained models for small, domain-specific medical datasets.
[419] FUGC: Benchmarking Semi-Supervised Learning Methods for Cervical Segmentation
Jieyun Bai, Yitong Tang, Zihao Zhou, Mahdi Islam, Musarrat Tabassum, Enrique Almar-Munoz, Hongyu Liu, Hui Meng, Nianjiang Lv, Bo Deng, Yu Chen, Zilun Peng, Yusong Xiao, Li Xiao, Nam-Khanh Tran, Dac-Phu Phan-Le, Hai-Dang Nguyen, Xiao Liu, Jiale Hu, Mingxu Huang, Jitao Liang, Chaolu Feng, Xuezhi Zhang, Lyuyang Tong, Bo Du, Ha-Hieu Pham, Thanh-Huy Nguyen, Min Xu, Juntao Jiang, Jiangning Zhang, Yong Liu, Md. Kamrul Hasan, Jie Gan, Zhuonan Liang, Weidong Cai, Yuxin Huang, Gongning Luo, Mohammad Yaqub, Karim Lekadir
Main category: eess.IV
TL;DR: FUGC is the first benchmark for semi-supervised cervical segmentation in TVS images, addressing preterm birth risk assessment with limited labeled data.
Details
Motivation: Accurate cervical segmentation in transvaginal ultrasound is crucial for preterm birth risk assessment, but supervised learning approaches are limited by scarce labeled data.Method: Created the FUGC benchmark with 890 TVS images (500 training, 90 validation, 300 test) and evaluated methods using weighted combination of Dice Similarity Coefficient (40%), Hausdorff Distance (40%), and runtime (20%).
Result: 10 teams with 82 participants submitted solutions; best methods achieved 90.26% mDSC, 38.88 mHD, and 32.85 ms RT respectively for individual metrics.
Conclusion: FUGC establishes a standardized benchmark for cervical segmentation, demonstrates semi-supervised methods’ efficacy with limited labeled data, and provides foundation for AI-assisted clinical preterm birth risk assessment.
Abstract: Accurate segmentation of cervical structures in transvaginal ultrasound (TVS) is critical for assessing the risk of spontaneous preterm birth (PTB), yet the scarcity of labeled data limits the performance of supervised learning approaches. This paper introduces the Fetal Ultrasound Grand Challenge (FUGC), the first benchmark for semi-supervised learning in cervical segmentation, hosted at ISBI 2025. FUGC provides a dataset of 890 TVS images, including 500 training images, 90 validation images, and 300 test images. Methods were evaluated using the Dice Similarity Coefficient (DSC), Hausdorff Distance (HD), and runtime (RT), with a weighted combination of 0.4/0.4/0.2. The challenge attracted 10 teams with 82 participants submitting innovative solutions. The best-performing methods for each individual metric achieved 90.26% mDSC, 38.88 mHD, and 32.85 ms RT, respectively. FUGC establishes a standardized benchmark for cervical segmentation, demonstrates the efficacy of semi-supervised methods with limited labeled data, and provides a foundation for AI-assisted clinical PTB risk assessment.
[420] THOR: A Versatile Foundation Model for Earth Observation Climate and Society Applications
Theodor Forgaard, Jarle H. Reksten, Anders U. Waldeland, Valerio Marsocci, Nicolas Longépé, Michael Kampffmeyer, Arnt-Børre Salberg
Main category: eess.IV
TL;DR: THOR is a compute-adaptive Earth observation foundation model that unifies multiple satellite sensors (Sentinel-1, -2, -3) with native resolutions from 10m to 1000m, enabling flexible patch sizes and dynamic compute-accuracy trade-offs without retraining.
Details
Motivation: Current Earth observation foundation models are architecturally rigid, struggle with heterogeneous sensors, and are constrained to fixed patch sizes, limiting their deployment in real-world scenarios requiring flexible compute-accuracy trade-offs.Method: THOR unifies data from Copernicus Sentinel-1, -2, and -3 satellites, processing their native resolutions (10m to 1000m) in a single model. It uses a novel randomized patch and input image size strategy during pre-training, allowing a single set of weights to be deployed with any patch size at inference.
Result: THOR achieves state-of-the-art performance on downstream benchmarks, particularly excelling in data-limited regimes like the PANGAEA 10% split, validating its flexible feature generation for diverse climate and society applications.
Conclusion: THOR solves both input heterogeneity and deployment rigidity in Earth observation, enabling dynamic trade-offs between computational cost and feature resolution without retraining, making it more practical for real-world deployment scenarios.
Abstract: Current Earth observation foundation models are architecturally rigid, struggle with heterogeneous sensors and are constrained to fixed patch sizes. This limits their deployment in real-world scenarios requiring flexible computeaccuracy trade-offs. We propose THOR, a “computeadaptive” foundation model that solves both input heterogeneity and deployment rigidity. THOR is the first architecture to unify data from Copernicus Sentinel-1, -2, and -3 (OLCI & SLSTR) satellites, processing their native 10 m to 1000 m resolutions in a single model. We pre-train THOR with a novel randomized patch and input image size strategy. This allows a single set of pre-trained weights to be deployed at inference with any patch size, enabling a dynamic trade-off between computational cost and feature resolution without retraining. We pre-train THOR on THOR Pretrain, a new, large-scale multi-sensor dataset and demonstrate state-of-the-art performance on downstream benchmarks, particularly in data-limited regimes like the PANGAEA 10% split, validating that THOR’s flexible feature generation excels for diverse climate and society applications.
[421] Phi-SegNet: Phase-Integrated Supervision for Medical Image Segmentation
Shams Nafisa Ali, Taufiq Hasan
Main category: eess.IV
TL;DR: Phi-SegNet introduces phase-aware frequency-domain supervision for medical image segmentation, achieving SOTA performance across multiple modalities with improved generalization.
Details
Motivation: Existing segmentation architectures focus on spatial information while overlooking frequency-domain representations that capture rich structural/textural cues. Current methods lack supervision-level integration of frequency information crucial for fine-grained object localization.Method: CNN-based architecture with Bi-Feature Mask Former (BFMF) modules to blend encoder features and reduce semantic gaps, Reverse Fourier Attention (RFA) blocks to refine decoder outputs using phase-regularized features, and a dedicated phase-aware loss for structural alignment.
Result: Achieved state-of-the-art on five public datasets (X-ray, US, histopathology, MRI, colonoscopy) with average relative improvement of 1.54±1.26% IoU and 0.98±0.71% F1-score. Demonstrated robust cross-dataset generalization on unseen datasets.
Conclusion: Leveraging spectral priors at both feature representation and supervision levels enables generalized segmentation frameworks with superior fine-grained object localization, demonstrating modality-agnostic adaptability.
Abstract: Deep learning has substantially advanced medical image segmentation, yet achieving robust generalization across diverse imaging modalities and anatomical structures remains a major challenge. A key contributor to this limitation lies in how existing architectures, ranging from CNNs to Transformers and their hybrids, primarily encode spatial information while overlooking frequency-domain representations that capture rich structural and textural cues. Although few recent studies have begun exploring spectral information at the feature level, supervision-level integration of frequency cues-crucial for fine-grained object localization-remains largely untapped. To this end, we propose Phi-SegNet, a CNN-based architecture that incorporates phase-aware information at both architectural and optimization levels. The network integrates Bi-Feature Mask Former (BFMF) modules that blend neighboring encoder features to reduce semantic gaps, and Reverse Fourier Attention (RFA) blocks that refine decoder outputs using phase-regularized features. A dedicated phase-aware loss aligns these features with structural priors, forming a closed feedback loop that emphasizes boundary precision. Evaluated on five public datasets spanning X-ray, US, histopathology, MRI, and colonoscopy, Phi-SegNet consistently achieved state-of-the-art performance, with an average relative improvement of 1.54+/-1.26% in IoU and 0.98+/-0.71% in F1-score over the next best-performing model. In cross-dataset generalization scenarios involving unseen datasets from the known domain, Phi-SegNet also exhibits robust and superior performance, highlighting its adaptability and modality-agnostic design. These findings demonstrate the potential of leveraging spectral priors in both feature representation and supervision, paving the way for generalized segmentation frameworks that excel in fine-grained object localization.
[422] From Text to Image: Exploring GPT-4Vision’s Potential in Advanced Radiological Analysis across Subspecialties
Felix Busch, Tianyu Han, Marcus Makowski, Daniel Truhn, Keno Bressem, Lisa Adams
Main category: eess.IV
TL;DR: GPT-4Vision shows potential for recognizing radiological features directly from medical images, outperforming text-based GPT-4 in diagnostic capabilities.
Details
Motivation: To evaluate whether vision-enabled language models like GPT-4Vision can directly interpret medical images for radiological tasks, potentially surpassing text-only models that rely on textual descriptions.Method: Comparative evaluation of GPT-4 (text-only) and GPT-4Vision (multimodal) on radiological tasks, assessing their ability to recognize features from medical images versus textual descriptions.
Result: GPT-4Vision demonstrates capability to recognize radiological features directly from images, suggesting enhanced diagnostic potential compared to GPT-4’s text-based approach.
Conclusion: Vision-enabled language models like GPT-4Vision show promise for radiological applications by directly interpreting medical images, potentially improving diagnostic accuracy over text-only models.
Abstract: The study evaluates and compares GPT-4 and GPT-4Vision for radiological tasks, suggesting GPT-4Vision may recognize radiological features from images, thereby enhancing its diagnostic potential over text-based descriptions.
[423] VHU-Net: Variational Hadamard U-Net for Body MRI Bias Field Correction
Xin Zhu, Ahmet Enis Cetin, Gorkem Durak, Batuhan Gundogdu, Ziliang Hong, Hongyi Pan, Ertugrul Aktas, Elif Keles, Hatice Savas, Aytekin Oto, Hiten Patel, Adam B. Murphy, Ashley Ross, Frank Miller, Baris Turkbey, Ulas Bagci
Main category: eess.IV
TL;DR: VHU-Net: A variational Hadamard U-Net architecture for MRI bias field correction using Hadamard transforms and variational inference to improve image uniformity and downstream segmentation.
Details
Motivation: Bias field artifacts in MRI scans cause intensity inhomogeneities that degrade image quality and hinder downstream analysis tasks like segmentation. Existing methods may lack computational efficiency, interpretability, or robustness across multi-center datasets.Method: Proposes VHU-Net with encoder using convolutional Hadamard transform blocks (ConvHTBlocks) that perform channel-wise frequency decomposition via Hadamard transform, scaling layers, and semi-soft thresholding. Decoder incorporates inverse HT-reconstructed transformer block for global frequency-aware attention. Uses variational inference with evidence lower bound (ELBO) objective to promote sparsity in latent space.
Result: Superior performance over state-of-the-art methods on body MRI datasets in terms of intensity uniformity. Corrected images yield substantial improvements in downstream segmentation accuracy. Framework offers computational efficiency, interpretability, and robust performance across multi-center datasets.
Conclusion: VHU-Net effectively addresses MRI bias field correction with a novel architecture combining Hadamard transforms, attention mechanisms, and variational inference. The method is suitable for clinical deployment due to its computational efficiency, interpretability, and robust performance.
Abstract: Bias field artifacts in magnetic resonance imaging (MRI) scans introduce spatially smooth intensity inhomogeneities that degrade image quality and hinder downstream analysis. To address this challenge, we propose a novel variational Hadamard U-Net (VHU-Net) for effective body MRI bias field correction. The encoder comprises multiple convolutional Hadamard transform blocks (ConvHTBlocks), each integrating convolutional layers with a Hadamard transform (HT) layer. Specifically, the HT layer performs channel-wise frequency decomposition to isolate low-frequency components, while a subsequent scaling layer and semi-soft thresholding mechanism suppress redundant high-frequency noise. To compensate for the HT layer’s inability to model inter-channel dependencies, the decoder incorporates an inverse HT-reconstructed transformer block, enabling global, frequency-aware attention for the recovery of spatially consistent bias fields. The stacked decoder ConvHTBlocks further enhance the capacity to reconstruct the underlying ground-truth bias field. Building on the principles of variational inference, we formulate a new evidence lower bound (ELBO) as the training objective, promoting sparsity in the latent space while ensuring accurate bias field estimation. Comprehensive experiments on body MRI datasets demonstrate the superiority of VHU-Net over existing state-of-the-art methods in terms of intensity uniformity. Moreover, the corrected images yield substantial downstream improvements in segmentation accuracy. Our framework offers computational efficiency, interpretability, and robust performance across multi-center datasets, making it suitable for clinical deployment.