Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 142]
- cs.CV [Total: 138]
- cs.AI [Total: 44]
- cs.SD [Total: 11]
- cs.LG [Total: 156]
- cs.MA [Total: 4]
- cs.MM [Total: 5]
- eess.AS [Total: 18]
- eess.IV [Total: 9]
cs.CL
[1] Automated Item Neutralization for Non-Cognitive Scales: A Large Language Model Approach to Reducing Social-Desirability Bias
Sirui Wu, Daijin Yang
Main category: cs.CL
TL;DR: This study evaluates using LLM-assisted item neutralization to reduce social desirability bias in personality assessments, showing mixed but promising results.
Details
Motivation: To address social desirability bias in personality assessments by leveraging large language models to neutralize test items, making them less susceptible to biased responding.Method: Used GPT-3 to rewrite the IPIP-BFM-50 personality measure, then had 203 participants complete either original or neutralized versions along with the Marlowe-Crowne Social Desirability Scale to compare bias reduction.
Result: Neutralized items preserved reliability and factor structure but showed inconsistent bias reduction - gains in Conscientiousness, declines in Agreeableness and Openness. Social desirability correlations decreased for some items but not consistently. Configural invariance held but metric and scalar invariance failed.
Conclusion: AI-assisted neutralization shows potential as a bias-reduction method but requires refinement as it’s currently imperfect, with inconsistent effects across personality dimensions.
Abstract: This study evaluates item neutralization assisted by the large language model (LLM) to reduce social desirability bias in personality assessment. GPT-o3 was used to rewrite the International Personality Item Pool Big Five Measure (IPIP-BFM-50), and 203 participants completed either the original or neutralized form along with the Marlowe-Crowne Social Desirability Scale. The results showed preserved reliability and a five-factor structure, with gains in Conscientiousness and declines in Agreeableness and Openness. The correlations with social desirability decreased for several items, but inconsistently. Configural invariance held, though metric and scalar invariance failed. Findings support AI neutralization as a potential but imperfect bias-reduction method.
[2] FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering
Gyubok Lee, Elea Bach, Eric Yang, Tom Pollard, Alistair Johnson, Edward Choi, Yugang jia, Jong Ha Lee
Main category: cs.CL
TL;DR: FHIR-AgentBench is a new benchmark for evaluating LLM agents on clinical data using the HL7 FHIR standard, addressing the gap in existing benchmarks for interoperable clinical data.
Details
Motivation: Existing benchmarks lack realism for evaluating LLMs on interoperable clinical data following the shift to HL7 FHIR standard, creating a need for more realistic evaluation tools.Method: Created a benchmark with 2,931 real-world clinical questions grounded in HL7 FHIR standard, systematically evaluating agentic frameworks with different data retrieval strategies, interaction patterns, and reasoning strategies.
Result: Experiments revealed practical challenges in retrieving data from complex FHIR resources and reasoning over them, both significantly impacting question answering performance.
Conclusion: FHIR-AgentBench dataset and evaluation suite are publicly released to promote reproducible research and development of robust LLM agents for clinical applications.
Abstract: The recent shift toward the Health Level Seven Fast Healthcare Interoperability Resources (HL7 FHIR) standard opens a new frontier for clinical AI, demanding LLM agents to navigate complex, resource-based data models instead of conventional structured health data. However, existing benchmarks have lagged behind this transition, lacking the realism needed to evaluate recent LLMs on interoperable clinical data. To bridge this gap, we introduce FHIR-AgentBench, a benchmark that grounds 2,931 real-world clinical questions in the HL7 FHIR standard. Using this benchmark, we systematically evaluate agentic frameworks, comparing different data retrieval strategies (direct FHIR API calls vs. specialized tools), interaction patterns (single-turn vs. multi-turn), and reasoning strategies (natural language vs. code generation). Our experiments highlight the practical challenges of retrieving data from intricate FHIR resources and the difficulty of reasoning over them, both of which critically affect question answering performance. We publicly release the FHIR-AgentBench dataset and evaluation suite (https://github.com/glee4810/FHIR-AgentBench) to promote reproducible research and the development of robust, reliable LLM agents for clinical applications.
[3] Readme_AI: Dynamic Context Construction for Large Language Models
Millie Vyas, Timothy Blattner, Alden Dima
Main category: cs.CL
TL;DR: A protocol for dynamically building context from data sources to improve LLM accuracy by reducing hallucinations and providing specialized, owner-provided metadata.
Details
Motivation: LLMs often provide inaccurate or unreliable information for specific queries despite extensive training, and query-specific context significantly enhances response usefulness.Method: A specification that allows data source owners to create metadata files for LLMs, implemented as a prototype Readme_AI MCP server that retrieves metadata and dynamically builds context using extensible types (web-pages, data repositories, publications, text) with user-specified tags.
Result: The prototype successfully enabled LLMs to reason about the NIST Hedgehog library and generate accurate code interpolated from examples, overcoming previous inaccuracies and hallucinations.
Conclusion: The primary contribution is an extensible protocol for dynamically grounding LLMs in specialized data, enhancing response quality and reducing hallucinations.
Abstract: Despite being trained on significant amounts of data, Large Language Models (LLMs) can provide inaccurate or unreliable information in the context of a user’s specific query. Given query-specific context significantly improves the usefulness of its responses. In this paper, we present a specification that can be used to dynamically build context for data sources. The data source owner creates the file containing metadata for LLMs to use when reasoning about dataset-related queries. To demonstrate our proposed specification, we created a prototype Readme_AI Model Context Protocol (MCP) server that retrieves the metadata from the data source and uses it to dynamically build context. Some features that make this specification dynamic are the extensible types that represent crawling web-pages, fetching data from data repositories, downloading and parsing publications, and general text. The context is formatted and grouped using user-specified tags that provide clear contextual information for the LLM to reason about the content. We demonstrate the capabilities of this early prototype by asking the LLM about the NIST-developed Hedgehog library, for which common LLMs often provides inaccurate and irrelevant responses containing hallucinations. With Readme_AI, the LLM receives enough context that it is now able to reason about the library and its use, and even generate code interpolated from examples that were included in the Readme_AI file provided by Hedgehog’s developer. Our primary contribution is a extensible protocol for dynamically grounding LLMs in specialized, owner-provided data, enhancing responses from LLMs and reducing hallucinations. The source code for the Readme_AI tool is posted here: https://github.com/usnistgov/readme_ai .
[4] Magnitude Matters: a Superior Class of Similarity Metrics for Holistic Semantic Understanding
V. S. Raghu Parupudi
Main category: cs.CL
TL;DR: This paper introduces new magnitude-aware similarity metrics (Overlap Similarity and Hyperbolic Tangent Similarity) that outperform traditional dot product and cosine similarity on paraphrase and inference tasks, while identifying limitations for compositional semantics tasks.
Details
Motivation: Current vector comparison standards (dot product and cosine similarity) have limitations - dot product is unbounded and norm-sensitive, while cosine similarity discards magnitude information entirely. The paper aims to develop better metrics that integrate both magnitude and alignment information.Method: Proposed two parameter-free, magnitude-aware similarity functions: Overlap Similarity (OS) and Hyperbolic Tangent Similarity (HTS). Conducted comprehensive evaluation using four state-of-the-art sentence embedding models across eight NLP benchmarks, with statistical significance testing using Wilcoxon signed-rank test.
Result: Both OS and HTS provided statistically significant improvement in Mean Squared Error over dot product and cosine similarity on tasks requiring holistic semantic understanding (paraphrase and inference). No significant improvement was observed on benchmarks testing nuanced compositional semantics (SICK, STS-B).
Conclusion: Magnitude-aware metrics offer superior performance for holistic semantic understanding tasks, while compositional semantics remains a distinct challenge requiring future research. The findings delineate specific domains where these new metrics are advantageous.
Abstract: Vector comparison in high dimensions is a fundamental task in NLP, yet it is dominated by two baselines: the raw dot product, which is unbounded and sensitive to vector norms, and the cosine similarity, which discards magnitude information entirely. This paper challenges both standards by proposing and rigorously evaluating a new class of parameter-free, magnitude-aware similarity metrics. I introduce two such functions, Overlap Similarity (OS) and Hyperbolic Tangent Similarity (HTS), designed to integrate vector magnitude and alignment in a more principled manner. To ensure that my findings are robust and generalizable, I conducted a comprehensive evaluation using four state-of-the-art sentence embedding models (all-MiniLM-L6-v2, all-mpnet-base-v2, paraphrase-mpnet-base-v2, and BAAI/bge-large-en-v1.5) across a diverse suite of eight standard NLP benchmarks, including STS-B, SICK, Quora, and PAWS. Using the Wilcoxon signed-rank test for statistical significance, my results are definitive: on the tasks requiring holistic semantic understanding (paraphrase and inference), both OS and HTS provide a statistically significant improvement in Mean Squared Error over both the raw dot product and cosine similarity, regardless of the underlying embedding model.Crucially, my findings delineate the specific domain of advantage for these metrics: for tasks requiring holistic semantic understanding like paraphrase and inference, my magnitude-aware metrics offer a statistically superior alternative. This significant improvement was not observed on benchmarks designed to test highly nuanced compositional semantics (SICK, STS-B), identifying the challenge of representing compositional text as a distinct and important direction for future work.
[5] How Much of Your Data Can Suck? Thresholds for Domain Performance and Emergent Misalignment in LLMs
Jian Ouyang, Arman T, Ge Jin
Main category: cs.CL
TL;DR: Fine-tuning LLMs on incorrect data causes emergent misalignment, with performance degradation starting at just 10-25% incorrect data. Models need at least 50% correct data to recover, but never match base model safety.
Details
Motivation: LLMs are increasingly used in high-stakes domains (finance, coding, law, health), but fine-tuning on incorrect data can produce harmful outputs unrelated to intended tasks.Method: Evaluated gpt-4o models fine-tuned with varying ratios (10% to 90% correct) of obviously and subtly incorrect data across four domains: coding, finance, health, and legal.
Result: Even modest incorrect data (10-25%) dramatically degrades domain performance. At least 50% correct data needed for recovery, but models rarely match base model robustness and safety (which shows near-perfect alignment).
Conclusion: The cost of incorrect data is heavy, emphasizing critical need for extremely high-quality data curation or using robust base models without unnecessary fine-tuning for high-stakes applications.
Abstract: This paper investigates the impact of incorrect data on the performance and safety of large language models (LLMs), specifically gpt-4o, during supervised fine-tuning (SFT). Although LLMs become increasingly vital across broad domains like finance, coding, law, and health, fine-tuning on incorrect data can lead to “emergent misalignment,” producing harmful or deceptive outputs unrelated to the intended task. We evaluate gpt-4o models fine-tuned with varying ratios (10% to 90% correct) of both obviously and subtly incorrect data across four domains: coding, finance, health, and legal. Our findings show that even modest amounts of incorrect data (10-25%) dramatically degrade domain performance and not moral alignment. A clear threshold of at least 50% correct data is needed for models to consistently recover strong performance, though they rarely match the robustness and safety of the base model, which exhibits near-perfect alignment and zero dangerous completions out-of-the-box. This research emphasizes that the cost of incorrect data is heavy, highlighting the critical need for extremely high-quality data curation or, alternatively, leveraging robust base models without unnecessary fine-tuning for high-stakes applications.
[6] Unveiling the Merits and Defects of LLMs in Automatic Review Generation for Scientific Papers
Ruochi Li, Haoxuan Zhang, Edward Gehringer, Ting Xiao, Junhua Ding, Haihua Chen
Main category: cs.CL
TL;DR: LLMs can generate structured reviews but struggle with critical reasoning and quality sensitivity. GPT-4o produces more descriptive content but significantly underperforms in identifying weaknesses and adjusting feedback based on paper quality.
Details
Motivation: The increasing volume of scientific submissions strains traditional peer review, prompting exploration of LLMs for automated review generation, but their limitations in critical reasoning need systematic evaluation.Method: Proposed a comprehensive evaluation framework using semantic similarity analysis and structured knowledge graph metrics, tested on 1,683 papers and 6,495 expert reviews from ICLR/NeurIPS using five LLMs.
Result: LLMs perform well in descriptive content (GPT-4o generated 15.74% more entities in strengths) but consistently underperform in weaknesses (GPT-4o produced 59.42% fewer entities) and quality sensitivity (only 5.7% increase in node count for weak papers vs 50% in human reviews).
Conclusion: LLMs have merits for descriptive reviewing but significant defects in critical analysis, providing empirical foundations for developing better LLM-assisted reviewing tools.
Abstract: The surge in scientific submissions has placed increasing strain on the traditional peer-review process, prompting the exploration of large language models (LLMs) for automated review generation. While LLMs demonstrate competence in producing structured and coherent feedback, their capacity for critical reasoning, contextual grounding, and quality sensitivity remains limited. To systematically evaluate these aspects, we propose a comprehensive evaluation framework that integrates semantic similarity analysis and structured knowledge graph metrics to assess LLM-generated reviews against human-written counterparts. We construct a large-scale benchmark of 1,683 papers and 6,495 expert reviews from ICLR and NeurIPS in multiple years, and generate reviews using five LLMs. Our findings show that LLMs perform well in descriptive and affirmational content, capturing the main contributions and methodologies of the original work, with GPT-4o highlighted as an illustrative example, generating 15.74% more entities than human reviewers in the strengths section of good papers in ICLR 2025. However, they consistently underperform in identifying weaknesses, raising substantive questions, and adjusting feedback based on paper quality. GPT-4o produces 59.42% fewer entities than real reviewers in the weaknesses and increases node count by only 5.7% from good to weak papers, compared to 50% in human reviews. Similar trends are observed across all conferences, years, and models, providing empirical foundations for understanding the merits and defects of LLM-generated reviews and informing the development of future LLM-assisted reviewing tools. Data, code, and more detailed results are publicly available at https://github.com/RichardLRC/Peer-Review.
[7] A systematic review of trial-matching pipelines using large language models
Braxton A. Morrison, Madhumita Sushil, Jacob S. Young
Main category: cs.CL
TL;DR: Systematic review of LLM-based clinical trial matching approaches (2020-2025) showing GPT-4 consistently outperforms other models, with promising strategies including zero-shot prompting and fine-tuning smaller models for privacy.
Details
Motivation: Manual clinical trial matching is labor-intensive and error-prone, leading to recruitment delays. LLM-based pipelines offer a promising automated solution.Method: Systematic review of 31 studies from 126 unique articles across three academic databases and one preprint server, analyzing LLM approaches to clinical trial matching including patient-to-criterion, patient-to-trial, and eligibility classification tasks.
Result: GPT-4 consistently outperformed other models in matching and eligibility extraction, though at higher cost. Promising strategies included zero-shot prompting with proprietary LLMs and fine-tuning smaller models for data privacy. Key challenges include data access, cost reduction, and mitigating hallucinations/bias.
Conclusion: Standardized metrics, realistic test sets, and attention to cost-efficiency and fairness are critical for broader deployment of LLM-based clinical trial matching systems.
Abstract: Matching patients to clinical trial options is critical for identifying novel treatments, especially in oncology. However, manual matching is labor-intensive and error-prone, leading to recruitment delays. Pipelines incorporating large language models (LLMs) offer a promising solution. We conducted a systematic review of studies published between 2020 and 2025 from three academic databases and one preprint server, identifying LLM-based approaches to clinical trial matching. Of 126 unique articles, 31 met inclusion criteria. Reviewed studies focused on matching patient-to-criterion only (n=4), patient-to-trial only (n=10), trial-to-patient only (n=2), binary eligibility classification only (n=1) or combined tasks (n=14). Sixteen used synthetic data; fourteen used real patient data; one used both. Variability in datasets and evaluation metrics limited cross-study comparability. In studies with direct comparisons, the GPT-4 model consistently outperformed other models, even finely-tuned ones, in matching and eligibility extraction, albeit at higher cost. Promising strategies included zero-shot prompting with proprietary LLMs like the GPT-4o model, advanced retrieval methods, and fine-tuning smaller, open-source models for data privacy when incorporation of large models into hospital infrastructure is infeasible. Key challenges include accessing sufficiently large real-world data sets, and deployment-associated challenges such as reducing cost, mitigating risk of hallucinations, data leakage, and bias. This review synthesizes progress in applying LLMs to clinical trial matching, highlighting promising directions and key limitations. Standardized metrics, more realistic test sets, and attention to cost-efficiency and fairness will be critical for broader deployment.
[8] How Model Size, Temperature, and Prompt Style Affect LLM-Human Assessment Score Alignment
Julie Jung, Max Lu, Sina Chole Benker, Dogus Darici
Main category: cs.CL
TL;DR: Model size is the key factor affecting alignment between LLMs and human assessments of clinical reasoning skills, with importance of checking alignments across multiple levels.
Details
Motivation: To understand how model size, temperature, and prompt style affect LLMs' alignment with themselves, between different models, and with human assessments in evaluating clinical reasoning skills.Method: Examined the effects of model size, temperature, and prompt style on LLMs’ alignment across different levels (self-alignment, inter-model alignment, and human-alignment) in clinical reasoning assessment.
Result: Model size emerged as the most significant factor influencing LLM-human score alignment in clinical reasoning evaluation.
Conclusion: The study highlights the critical importance of checking alignments across multiple levels when using LLMs for clinical reasoning assessment, with model size being a key consideration.
Abstract: We examined how model size, temperature, and prompt style affect Large Language Models’ (LLMs) alignment within itself, between models, and with human in assessing clinical reasoning skills. Model size emerged as a key factor in LLM-human score alignment. Study highlights the importance of checking alignments across multiple levels.
[9] Quantifying Compositionality of Classic and State-of-the-Art Embeddings
Zhijin Guo, Chenhao Xue, Zhaozhen Xu, Hongbo Bo, Yuxuan Ye, Janet B. Pierrehumbert, Martha Lewis
Main category: cs.CL
TL;DR: This paper proposes a two-step evaluation framework to quantify additive compositionality in language models, tracking how well models maintain linear compositional relationships across different layers and training stages.
Details
Motivation: Current SOTA models (transformers and graph models) lack proper limits on meaning shifts due to context, while older models like Word2vec made excessive claims about compositionality. The authors aim to develop a rigorous method to measure how well models exploit compositional meanings.Method: A two-step evaluation: (1) measure linearity between entity attributes and embeddings using canonical correlation analysis, (2) evaluate additive generalization by reconstructing embeddings for unseen attribute combinations and checking reconstruction metrics (L2 loss, cosine similarity, retrieval accuracy).
Result: Stronger compositional signals are observed in later training stages across data modalities, and in deeper layers of transformer-based models before declining at the top layer. The method successfully captures failure cases where linear composition breaks down.
Conclusion: The proposed framework provides a systematic way to quantify compositionality in language models, revealing important patterns about how compositional understanding develops during training and across model layers.
Abstract: For language models to generalize correctly to novel expressions, it is critical that they exploit access compositional meanings when this is justified. Even if we don’t know what a “pelp” is, we can use our knowledge of numbers to understand that “ten pelps” makes more pelps than “two pelps”. Static word embeddings such as Word2vec made strong, indeed excessive, claims about compositionality. The SOTA generative, transformer models and graph models, however, go too far in the other direction by providing no real limits on shifts in meaning due to context. To quantify the additive compositionality, we formalize a two-step, generalized evaluation that (i) measures the linearity between known entity attributes and their embeddings via canonical correlation analysis, and (ii) evaluates additive generalization by reconstructing embeddings for unseen attribute combinations and checking reconstruction metrics such as L2 loss, cosine similarity, and retrieval accuracy. These metrics also capture failure cases where linear composition breaks down. Sentences, knowledge graphs, and word embeddings are evaluated and tracked the compositionality across all layers and training stages. Stronger compositional signals are observed in later training stages across data modalities, and in deeper layers of the transformer-based model before a decline at the top layer. Code is available at https://github.com/Zhijin-Guo1/quantifying-compositionality.
[10] Pluralistic Off-policy Evaluation and Alignment
Chengkai Huang, Junda Wu, Zhouhang Xie, Yu Xia, Rui Wang, Tong Yu, Subrata Mitra, Julian McAuley, Lina Yao
Main category: cs.CL
TL;DR: POPE is the first framework for offline pluralistic preference evaluation and alignment in LLMs, combining collaborative utility and diversity components to capture diverse human preferences.
Details
Motivation: Existing preference alignment datasets and off-policy estimators focus solely on overall utility while ignoring preference pluralism, which is essential for personalized alignment with diverse human preferences.Method: POPE includes a unified reward function combining collaborative utility from human preference signals and diversity inspired by entropy-based coverage measures, with decomposable inverse propensity scoring (IPS) estimators for evaluation.
Result: Empirical results show that POPE efficiently enhances pluralistic response generation while maintaining models’ general capabilities on downstream tasks.
Conclusion: POPE successfully addresses the open problem of extending Off-Policy Evaluation to pluralistic preference alignment, providing a theoretically grounded framework for evaluating and optimizing LLMs for diverse human preferences.
Abstract: Personalized preference alignment for LLMs with diverse human preferences requires evaluation and alignment methods that capture pluralism. Most existing preference alignment datasets are logged under policies that differ substantially from the evaluated LLMs, and existing off-policy estimators focus solely on overall utility while ignoring preference pluralism. Extending Off-Policy Evaluation (OPE) to pluralistic preference alignment, therefore, remains an open question. Thus, we propose the Pluralistic Off-Policy Evaluation (POPE), the first framework for offline pluralistic preference evaluation and alignment in LLMs. POPE includes a unified reward function that combines (1) a collaborative utility component derived from human preference signals (e.g., upvotes or relevance scores) and (2) a diversity component inspired by entropy-based coverage measures, together reflecting pluralistic alignment. Furthermore, to estimate this reward from logged interactions, we derive decomposable inverse propensity scoring (IPS) estimators that separately evaluate relevance and diversity. Theoretically, we prove that our decomposed IPS estimators establish a lower bound on their variance. With the off-policy evaluated value function, we can directly enable off-policy optimization to further enhance pluralistic alignment. Empirical results demonstrate that POPE efficiently enhances pluralistic response generation and maintains the models' general capabilities on downstream tasks
[11] PART: Progressive Alignment Representation Training for Multilingual Speech-To-Text with LLMs
Pei Zhang, Andong Chen, Xi Chen, Baosong Yang, Derek F. Wong, Fei Huang
Main category: cs.CL
TL;DR: PART is a multi-stage framework for multilingual speech-text alignment that separates within-language and cross-language training, outperforming conventional methods by dynamically activating LLM parameters and introducing text-based tasks.
Details
Motivation: Existing methods freeze LLM parameters and force cross-language convergence, which limits performance in multilingual speech-text alignment settings.Method: Progressive Alignment Representation Training (PART) - a multi-stage, multi-task framework that separates within-language from cross-language alignment, dynamically activates LLM parameters during cross-language training, and introduces text-based tasks to enhance multilingual understanding.
Result: Experiments on CommonVoice 15, Fleurs, Wenetspeech, and CoVoST2 show PART surpasses conventional approaches, with analysis confirming its ability to balance language-specific distinctions and cross-language generalization.
Conclusion: PART demonstrates effectiveness and generality for multilingual speech modality alignment by better balancing language-specific features and cross-language generalization.
Abstract: Large language models (LLMs) have expanded from text to speech, giving rise to Speech Large Models (SLMs) that support recognition, translation, and synthesis. A key challenge is aligning speech and text representations, which becomes harder in multilingual settings. Existing methods often freeze LLM parameters and train encoders on multilingual data, but this forces cross-language convergence and limits performance. We introduce Progressive Alignment Representation Training (PART), a multi-stage and multi-task framework that separates within-language from cross-language alignment. During cross-language training, LLM parameters are dynamically activated, and text-based tasks are later introduced to enhance multilingual understanding. Experiments on CommonVoice 15, Fleurs, Wenetspeech, and CoVoST2 show that PART surpasses conventional approaches, with analysis confirming its ability to balance language-specific distinctions and cross-language generalization. These results demonstrate PART’s effectiveness and generality for multilingual speech modality alignment.
[12] Cognitive-Level Adaptive Generation via Capability-Aware Retrieval and Style Adaptation
Qingsong Wang, Tao Wu, Wang Lin, Yueying Feng, Gongsheng Yuan, Chang Yao, Jingyuan Chen
Main category: cs.CL
TL;DR: CLAF framework addresses cognitive misalignment in LLMs by aligning knowledge complexity and presentation style with user cognition using capability-aware retrieval and style optimization.
Details
Motivation: LLMs struggle to adapt content to users with different cognitive capacities, leading to knowledge-level and presentation-style misalignment that hinders effective comprehension.Method: Proposes CLAF framework with: capability-aware retrieval based on hierarchical knowledge graph, style optimization guided by Bloom’s taxonomy and preference learning, and knowledge-controllable generation for consistency. Uses SCALE dataset for training/evaluation.
Result: Empirical results show CLAF enhances adaptability and informativeness of LLM outputs across various user profiles.
Conclusion: CLAF offers a robust solution to cognitive-level alignment in real-world applications, improving LLM performance for users with diverse cognitive capacities.
Abstract: Large Language Models (LLMs) have demonstrated strong performance in open-ended generation tasks. However, they often struggle to adapt content to users with differing cognitive capacities, leading to a phenomenon we term cognitive misalignment. This issue arises in two forms: knowledge-level misalignment, where content is too complex or too simplistic relative to user understanding, and presentation-style misalignment, where the structure or tone hinders effective comprehension. To address these challenges, we propose the Cognitive-Level Alignment Framework (CLAF), a general-purpose generation framework that aligns both knowledge complexity and presentation style with user cognition. CLAF integrates a capability-aware retrieval module based on a hierarchical knowledge graph and a style optimization module guided by Bloom’s taxonomy and preference learning. Additionally, a knowledge-controllable generation component ensures consistency and relevance throughout the output. To support training and evaluation, we construct SCALE, a cognitively annotated dataset containing responses at multiple comprehension levels per query. Empirical results show that CLAF enhances the adaptability and informativeness of LLM outputs across a range of user profiles, offering a robust solution to cognitive-level alignment in real-world applications.
[13] Part-of-speech tagging for Nagamese Language using CRF
Alovi N Shohe, Chonglio Khiamungam, Teisovi Angami
Main category: cs.CL
TL;DR: First POS tagging study for Nagamese language using CRF machine learning, achieving 85.70% accuracy on a 16,112-token annotated corpus.
Details
Motivation: Nagamese is an under-resourced Assamese-lexified Creole language with no existing POS tagging work, unlike resource-rich languages like English and Hindi.Method: Created annotated corpus of 16,112 tokens and applied Conditional Random Fields (CRF) machine learning technique for POS tagging.
Result: Achieved overall tagging accuracy of 85.70% with precision of 86%, recall of 86%, and f1-score of 85%.
Conclusion: Successfully demonstrated the first POS tagging system for Nagamese language, providing a foundation for future NLP work on this under-resourced language.
Abstract: This paper investigates part-of-speech tagging, an important task in Natural Language Processing (NLP) for the Nagamese language. The Nagamese language, a.k.a. Naga Pidgin, is an Assamese-lexified Creole language developed primarily as a means of communication in trade between the Nagas and people from Assam in northeast India. A substantial amount of work in part-of-speech-tagging has been done for resource-rich languages like English, Hindi, etc. However, no work has been done in the Nagamese language. To the best of our knowledge, this is the first attempt at part-of-speech tagging for the Nagamese Language. The aim of this work is to identify the part-of-speech for a given sentence in the Nagamese language. An annotated corpus of 16,112 tokens is created and applied machine learning technique known as Conditional Random Fields (CRF). Using CRF, an overall tagging accuracy of 85.70%; precision, recall of 86%, and f1-score of 85% is achieved. Keywords. Nagamese, NLP, part-of-speech, machine learning, CRF.
[14] Performance of Large Language Models in Answering Critical Care Medicine Questions
Mahmoud Alwakeel, Aditya Nagori, An-Kwok Ian Wong, Neal Chaisson, Vijay Krishnamoorthy, Rishikesan Kamaleswaran
Main category: cs.CL
TL;DR: Evaluation of Meta-Llama 3.1 models (8B and 70B) on Critical Care Medicine questions shows 70B model outperforms 8B by 30% with 60% average accuracy, with performance varying significantly across medical subspecialties.
Details
Motivation: To assess LLM performance in specialized medical fields like Critical Care Medicine, which is less explored compared to general medical student-level questions.Method: Tested Meta-Llama 3.1 models (8B and 70B parameters) on 871 Critical Care Medicine questions across different medical domains.
Result: Llama3.1:70B achieved 60% average accuracy, outperforming 8B model by 30%. Performance varied by domain: highest in Research (68.4%) and lowest in Renal (47.9%).
Conclusion: There is a need for broader future work to improve LLM performance across various medical subspecialty domains, as current models show significant variability in different areas of Critical Care Medicine.
Abstract: Large Language Models have been tested on medical student-level questions, but their performance in specialized fields like Critical Care Medicine (CCM) is less explored. This study evaluated Meta-Llama 3.1 models (8B and 70B parameters) on 871 CCM questions. Llama3.1:70B outperformed 8B by 30%, with 60% average accuracy. Performance varied across domains, highest in Research (68.4%) and lowest in Renal (47.9%), highlighting the need for broader future work to improve models across various subspecialty domains.
[15] Retrieval Augmented Generation based context discovery for ASR
Dimitrios Siskos, Stavros Papadopoulos, Pablo Peso Parada, Jisi Zhang, Karthikeyan Saravanan, Anastasios Drosou
Main category: cs.CL
TL;DR: This paper proposes an embedding-based retrieval approach for automatic context discovery in ASR systems to improve transcription accuracy for rare terms, comparing it with LLM-based alternatives.
Details
Motivation: To address the challenge of automatically identifying the right context for improving ASR transcription accuracy, particularly for rare or out-of-vocabulary terms.Method: Proposes an efficient embedding-based retrieval approach for automatic context discovery, and compares it with two LLM-based alternatives: (1) LLM-based context generation via prompting, and (2) post-recognition transcript correction using LLMs.
Result: The proposed approach reduces WER by up to 17% relative to no-context baseline, while oracle context achieves up to 24.1% reduction on TED-LIUMv3, Earnings21 and SPGISpeech datasets.
Conclusion: The embedding-based retrieval approach is an effective strategy for automatic context discovery in ASR systems, significantly improving transcription accuracy for rare terms.
Abstract: This work investigates retrieval augmented generation as an efficient strategy for automatic context discovery in context-aware Automatic Speech Recognition (ASR) system, in order to improve transcription accuracy in the presence of rare or out-of-vocabulary terms. However, identifying the right context automatically remains an open challenge. This work proposes an efficient embedding-based retrieval approach for automatic context discovery in ASR. To contextualize its effectiveness, two alternatives based on large language models (LLMs) are also evaluated: (1) large language model (LLM)-based context generation via prompting, and (2) post-recognition transcript correction using LLMs. Experiments on the TED-LIUMv3, Earnings21 and SPGISpeech demonstrate that the proposed approach reduces WER by up to 17% (percentage difference) relative to using no-context, while the oracle context results in a reduction of up to 24.1%.
[16] SCORE: A Semantic Evaluation Framework for Generative Document Parsing
Renyu Li, Antonio Jimeno Yepes, Yao You, Kamil Pluciński, Maximilian Operlejn, Crag Wolfe
Main category: cs.CL
TL;DR: SCORE is a new evaluation framework for multi-modal generative document parsing that addresses limitations of traditional metrics by being interpretation-agnostic and handling structural diversity while maintaining semantic rigor.
Details
Motivation: Traditional evaluation metrics (CER, WER, IoU, TEDS) misclassify semantically correct but structurally divergent outputs from generative document parsing systems as errors, penalizing valid interpretations and obscuring true system performance.Method: SCORE integrates four components: (1) adjusted edit distance for content fidelity, (2) token-level diagnostics to distinguish hallucinations from omissions, (3) table evaluation with spatial tolerance and semantic alignment, and (4) hierarchy-aware consistency checks. It normalizes generative outputs into format-agnostic representations.
Result: Across 1,114 pages, SCORE revealed cross-dataset performance patterns missed by standard metrics. In ambiguous table structures (2-5% of pages), traditional metrics penalized systems by 12-25%, but SCORE corrected these cases. SCORE also reproduced traditional scores (table F1 up to 0.93) without object-detection pipelines.
Conclusion: SCORE establishes foundational principles for semantically grounded, fair, and practical benchmarking of modern document parsing systems by exposing how interpretive diversity impacts evaluation and providing multi-dimensional, interpretable diagnostics.
Abstract: Multi-modal generative document parsing systems challenge traditional evaluation: unlike deterministic OCR or layout models, they often produce semantically correct yet structurally divergent outputs. Conventional metrics-CER, WER, IoU, or TEDS-misclassify such diversity as error, penalizing valid interpretations and obscuring system behavior. We introduce SCORE (Structural and COntent Robust Evaluation), an interpretation-agnostic framework that integrates (i) adjusted edit distance for robust content fidelity, (ii) token-level diagnostics to distinguish hallucinations from omissions, (iii) table evaluation with spatial tolerance and semantic alignment, and (iv) hierarchy-aware consistency checks. Together, these dimensions enable evaluation that embraces representational diversity while enforcing semantic rigor. Across 1,114 pages spanning a holistic benchmark and a field dataset, SCORE consistently revealed cross-dataset performance patterns missed by standard metrics. In 2-5% of pages with ambiguous table structures, traditional metrics penalized systems by 12-25% on average, leading to distorted rankings. SCORE corrected these cases, recovering equivalence between alternative but valid interpretations. Moreover, by normalizing generative outputs into a format-agnostic representation, SCORE reproduces traditional scores (e.g., table F1 up to 0.93) without requiring object-detection pipelines, demonstrating that generative parsing alone suffices for comprehensive evaluation. By exposing how interpretive diversity impacts evaluation outcomes and providing multi-dimensional, interpretable diagnostics, SCORE establishes foundational principles for semantically grounded, fair, and practical benchmarking of modern document parsing systems.
[17] Benchmarking ChatGPT and DeepSeek in April 2025: A Novel Dual Perspective Sentiment Analysis Using Lexicon-Based and Deep Learning Approaches
Maryam Mahdi Alhusseini, Mohammad-Reza Feizi-Derakhshi
Main category: cs.CL
TL;DR: A dual-perspective approach combining lexicon-based sentiment analysis with deep learning models (CNN and Bi-LSTM) to analyze user reviews of ChatGPT and DeepSeek on Google Play Store, showing CNN’s superior performance and ChatGPT’s more positive sentiment.
Details
Motivation: To address the gap in prior research that focuses on either lexicon-based strategies or deep learning models in isolation, by conducting an extensive investigation into user satisfaction with LLM-based applications.Method: Collected 4,000 authentic user reviews, preprocessed and oversampled for balanced classes. Used TextBlob for lexicon-based analysis and CNN/Bi-LSTM deep learning models for classification, testing on a balanced set of 1,700 reviews.
Result: ChatGPT received significantly more positive sentiment than DeepSeek. CNN outperformed Bi-LSTM with 96.41% accuracy and near-perfect classification of negative reviews, alongside high F1-scores for neutral and positive sentiments.
Conclusion: This research sets a new methodological standard for measuring sentiment in LLM-based applications and provides practical insights for improving user-centric AI system design.
Abstract: This study presents a novel dual-perspective approach to analyzing user reviews for ChatGPT and DeepSeek on the Google Play Store, integrating lexicon-based sentiment analysis (TextBlob) with deep learning classification models, including Convolutional Neural Networks (CNN) and Bidirectional Long Short Term Memory (Bi LSTM) Networks. Unlike prior research, which focuses on either lexicon-based strategies or predictive deep learning models in isolation, this study conducts an extensive investigation into user satisfaction with Large Language Model (LLM) based applications. A Dataset of 4,000 authentic user reviews was collected, which were carefully preprocessed and subjected to oversampling to achieve balanced classes. The balanced test set of 1,700 Reviews were used for model testing. Results from the experiments reveal that ChatGPT received significantly more positive sentiment than DeepSeek. Furthermore, deep learning based classification demonstrated superior performance over lexicon analysis, with CNN outperforming Bi-LSTM by achieving 96.41 percent accuracy and near perfect classification of negative reviews, alongside high F1-scores for neutral and positive sentiments. This research sets a new methodological standard for measuring sentiment in LLM-based applications and provides practical insights for developers and researchers seeking to improve user-centric AI system design.
[18] Z-Scores: A Metric for Linguistically Assessing Disfluency Removal
Maria Teleki, Sai Janjur, Haoran Liu, Oliver Grabner, Ketan Verma, Thomas Docog, Xiangjue Dong, Lingfeng Shi, Cong Wang, Stephanie Birkelbach, Jason Kim, Yin Zhang, James Caverlee
Main category: cs.CL
TL;DR: Z-Scores is a span-level evaluation metric for disfluency removal that categorizes system behavior across distinct disfluency types, providing more detailed diagnostics than traditional word-level metrics.
Details
Motivation: Traditional word-based metrics like precision, recall, and F1 capture overall performance but cannot reveal why models succeed or fail on specific disfluency types, limiting targeted improvements.Method: Introduces Z-Scores with a deterministic alignment module that enables robust mapping between generated text and disfluent transcripts, categorizing system behavior across disfluency types (EDITED, INTJ, PRN).
Result: Z-Scores expose systematic weaknesses that word-level metrics obscure, enabling identification of model failure modes and design of targeted interventions like tailored prompts or data augmentation.
Conclusion: Z-Scores provide category-specific diagnostics that uncover challenges with specific disfluency types (INTJ and PRN) hidden in aggregate F1 scores, directly informing model refinement strategies.
Abstract: Evaluating disfluency removal in speech requires more than aggregate token-level scores. Traditional word-based metrics such as precision, recall, and F1 (E-Scores) capture overall performance but cannot reveal why models succeed or fail. We introduce Z-Scores, a span-level linguistically-grounded evaluation metric that categorizes system behavior across distinct disfluency types (EDITED, INTJ, PRN). Our deterministic alignment module enables robust mapping between generated text and disfluent transcripts, allowing Z-Scores to expose systematic weaknesses that word-level metrics obscure. By providing category-specific diagnostics, Z-Scores enable researchers to identify model failure modes and design targeted interventions – such as tailored prompts or data augmentation – yielding measurable performance improvements. A case study with LLMs shows that Z-Scores uncover challenges with INTJ and PRN disfluencies hidden in aggregate F1, directly informing model refinement strategies.
[19] Characterizing Knowledge Graph Tasks in LLM Benchmarks Using Cognitive Complexity Frameworks
Sara Todorovikj, Lars-Peter Meyer, Michael Martin
Main category: cs.CL
TL;DR: The paper proposes using cognitive psychology frameworks to characterize task complexity in LLM-KG evaluation, moving beyond just accuracy metrics to enable richer interpretation and diversity in benchmark tasks.
Details
Motivation: Current evaluation of LLMs on Knowledge Graph tasks focuses primarily on accuracy and output correctness, which provides a limited view of model capabilities. The authors aim to complement existing metrics with task complexity analysis from cognitive psychology.Method: The authors apply three complexity frameworks from cognitive psychology to the LLM-KG-Bench framework to analyze task demands and identify underrepresented complexity aspects in current evaluations.
Result: The analysis reveals value distributions in task complexity and highlights currently underrepresented demands, suggesting that current benchmarks may not fully capture the range of cognitive challenges in LLM-KG interactions.
Conclusion: Incorporating cognitive psychology frameworks enables richer interpretation and promotes diversity in benchmark evaluation tasks, providing a more comprehensive assessment of LLM capabilities on Knowledge Graph tasks beyond simple accuracy metrics.
Abstract: Large Language Models (LLMs) are increasingly used for tasks involving Knowledge Graphs (KGs), whose evaluation typically focuses on accuracy and output correctness. We propose a complementary task characterization approach using three complexity frameworks from cognitive psychology. Applying this to the LLM-KG-Bench framework, we highlight value distributions, identify underrepresented demands and motivate richer interpretation and diversity for benchmark evaluation tasks.
[20] DRES: Benchmarking LLMs for Disfluency Removal
Maria Teleki, Sai Janjur, Haoran Liu, Oliver Grabner, Ketan Verma, Thomas Docog, Xiangjue Dong, Lingfeng Shi, Cong Wang, Stephanie Birkelbach, Jason Kim, Yin Zhang, James Caverlee
Main category: cs.CL
TL;DR: DRES is a controlled text-level benchmark for evaluating disfluency removal in speech systems, providing reproducible semantic upper bounds and systematic evaluation of LLMs.
Details
Motivation: Disfluencies like 'um', 'uh', and edited statements degrade accuracy in speech-driven systems for command interpretation, summarization, and conversational agents.Method: DRES builds on human-annotated Switchboard transcripts to isolate disfluency removal from ASR errors. It systematically evaluates proprietary and open-source LLMs across scales, prompting strategies, and architectures.
Result: Findings show: (i) simple segmentation improves performance even for long-context models; (ii) reasoning-oriented models tend to over-delete fluent tokens; (iii) fine-tuning achieves near state-of-the-art precision but harms generalization. The paper provides LLM-specific error modes and 9 practical recommendations.
Conclusion: DRES provides a reproducible, model-agnostic foundation for advancing robust spoken-language systems by establishing controlled evaluation standards for disfluency removal.
Abstract: Disfluencies – such as “um,” “uh,” interjections, parentheticals, and edited statements – remain a persistent challenge for speech-driven systems, degrading accuracy in command interpretation, summarization, and conversational agents. We introduce DRES (Disfluency Removal Evaluation Suite), a controlled text-level benchmark that establishes a reproducible semantic upper bound for this task. DRES builds on human-annotated Switchboard transcripts, isolating disfluency removal from ASR errors and acoustic variability. We systematically evaluate proprietary and open-source LLMs across scales, prompting strategies, and architectures. Our results reveal that (i) simple segmentation consistently improves performance, even for long-context models; (ii) reasoning-oriented models tend to over-delete fluent tokens; and (iii) fine-tuning achieves near state-of-the-art precision and recall but harms generalization abilities. We further present a set of LLM-specific error modes and offer nine practical recommendations (R1-R9) for deploying disfluency removal in speech-driven pipelines. DRES provides a reproducible, model-agnostic foundation for advancing robust spoken-language systems.
[21] ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution
Robert Tjarko Lange, Yuki Imajuku, Edoardo Cetin
Main category: cs.CL
TL;DR: ShinkaEvolve is an open-source framework that uses LLMs as mutation operators for scientific discovery, achieving state-of-the-art performance with high sample efficiency through innovative techniques like parent sampling, code novelty rejection-sampling, and bandit-based LLM ensemble selection.
Details
Motivation: Current code evolution methods for scientific discovery are sample inefficient (requiring thousands of samples) and remain closed-source, limiting broad adoption and extension. The authors aim to address these limitations to democratize open-ended discovery.Method: The framework introduces three key innovations: 1) parent sampling technique balancing exploration and exploitation, 2) code novelty rejection-sampling for efficient search space exploration, and 3) bandit-based LLM ensemble selection strategy.
Result: ShinkaEvolve achieves exceptional sample efficiency, discovering a new state-of-the-art circle packing solution using only 150 samples, designing high-performing agentic harnesses for AIME mathematical reasoning tasks, improving ALE-Bench competitive programming solutions, and discovering novel mixture-of-expert load balancing loss functions.
Conclusion: ShinkaEvolve demonstrates broad applicability with exceptional sample efficiency, providing open-source accessibility and cost-efficiency to democratize open-ended discovery across diverse computational problems.
Abstract: We introduce ShinkaEvolve: a new open-source framework leveraging large language models (LLMs) to advance scientific discovery with state-of-the-art performance and unprecedented efficiency. Recent advances in scaling inference time compute of LLMs have enabled significant progress in generalized scientific discovery. These approaches rely on evolutionary agentic harnesses that leverage LLMs as mutation operators to generate candidate solutions. However, current code evolution methods suffer from critical limitations: they are sample inefficient, requiring thousands of samples to identify effective solutions, and remain closed-source, hindering broad adoption and extension. ShinkaEvolve addresses these limitations, introducing three key innovations: a parent sampling technique balancing exploration and exploitation, code novelty rejection-sampling for efficient search space exploration, and a bandit-based LLM ensemble selection strategy. We evaluate ShinkaEvolve across diverse tasks, demonstrating consistent improvements in sample efficiency and solution quality. ShinkaEvolve discovers a new state-of-the-art circle packing solution using only 150 samples, designs high-performing agentic harnesses for AIME mathematical reasoning tasks, identifies improvements to ALE-Bench competitive programming solutions, and discovers novel mixture-of-expert load balancing loss functions that illuminate the space of optimization strategies. Our results demonstrate that ShinkaEvolve achieves broad applicability with exceptional sample efficiency. By providing open-source accessibility and cost-efficiency, this work democratizes open-ended discovery across diverse computational problems.
[22] TriSPrompt: A Hierarchical Soft Prompt Model for Multimodal Rumor Detection with Incomplete Modalities
Jiajun Chen, Yangyang Wu, Xiaoye Miao, Mengying Zhu, Meng Xi
Main category: cs.CL
TL;DR: TriSPrompt is a hierarchical soft prompt model that addresses incomplete multimodal data in rumor detection by using three types of prompts to handle modality-aware information, missing modalities, and mutual views between subjective and objective perspectives.
Details
Motivation: Existing multimodal rumor detection methods fail to handle incomplete modalities commonly found in real-world data, limiting their practical effectiveness.Method: Proposes TriSPrompt with three prompt types: modality-aware (MA) prompt for heterogeneous/homogeneous feature capture and modality recovery, modality-missing (MM) prompt for modeling missing states, and mutual-views (MV) prompt for learning relationships between subjective (text/image) and objective (comments) perspectives.
Result: Achieves over 13% accuracy gain compared to state-of-the-art methods on three real-world benchmarks.
Conclusion: TriSPrompt effectively addresses the challenge of incomplete multimodal data in rumor detection through hierarchical soft prompting, demonstrating significant performance improvements over existing methods.
Abstract: The widespread presence of incomplete modalities in multimodal data poses a significant challenge to achieving accurate rumor detection. Existing multimodal rumor detection methods primarily focus on learning joint modality representations from \emph{complete} multimodal training data, rendering them ineffective in addressing the common occurrence of \emph{missing modalities} in real-world scenarios. In this paper, we propose a hierarchical soft prompt model \textsf{TriSPrompt}, which integrates three types of prompts, \textit{i.e.}, \emph{modality-aware} (MA) prompt, \emph{modality-missing} (MM) prompt, and \emph{mutual-views} (MV) prompt, to effectively detect rumors in incomplete multimodal data. The MA prompt captures both heterogeneous information from specific modalities and homogeneous features from available data, aiding in modality recovery. The MM prompt models missing states in incomplete data, enhancing the model’s adaptability to missing information. The MV prompt learns relationships between subjective (\textit{i.e.}, text and image) and objective (\textit{i.e.}, comments) perspectives, effectively detecting rumors. Extensive experiments on three real-world benchmarks demonstrate that \textsf{TriSPrompt} achieves an accuracy gain of over 13% compared to state-of-the-art methods. The codes and datasets are available at https: //anonymous.4open.science/r/code-3E88.
[23] RoadMind: Towards a Geospatial AI Expert for Disaster Response
Ahmed El Fekih Zguir, Ferda Ofli, Muhammad Imran
Main category: cs.CL
TL;DR: RoadMind is a self-supervised framework that enhances LLMs’ geospatial reasoning capabilities using OpenStreetMap data, particularly for disaster response scenarios.
Details
Motivation: LLMs lack robust geospatial reasoning abilities needed for critical disaster response tasks like evacuation planning and resource allocation.Method: Automated pipeline extracts road infrastructure data from OpenStreetMap, converts it into multiple supervision formats, and trains LLMs using QLoRA adapters with 4-bit quantization.
Result: RoadMind-trained models significantly outperform state-of-the-art LLMs with advanced prompt engineering on tasks like road segment identification, nearest road retrieval, and distance/direction estimation across three disaster-prone cities.
Conclusion: Structured geospatial data can effectively enhance language models with robust spatial reasoning capabilities for offline AI disaster response systems.
Abstract: Large Language Models (LLMs) have shown impressive performance across a range of natural language tasks, but remain limited in their ability to reason about geospatial data, particularly road networks, distances, and directions. This gap poses challenges in disaster scenarios, where spatial understanding is critical for tasks such as evacuation planning and resource allocation. In this work, we present RoadMind, a self-supervised framework that enhances the geospatial reasoning capabilities of LLMs using structured data from OpenStreetMap (OSM). Our automated pipeline extracts road infrastructure data for a given city and converts it into multiple supervision formats tailored to key spatial tasks. We pretrain and fine-tune LLMs on these representations using QLoRA adapters and 4-bit quantized models. We evaluate our approach on three disaster-prone cities with varying global representation, Los Angeles, Christchurch, and Manila, across tasks such as road segment identification, nearest road retrieval, and distance/direction estimation. Our results show that models trained via RoadMind significantly outperform strong baselines, including state-of-the-art LLMs equipped with advanced prompt engineering. This demonstrates the potential of structured geospatial data to enhance language models with robust spatial reasoning, enabling more effective offline AI systems for disaster response.
[24] Benchmarking and Improving LLM Robustness for Personalized Generation
Chimaobi Okite, Naihao Deng, Kiran Bodipati, Huaidian Hou, Joyce Chai, Rada Mihalcea
Main category: cs.CL
TL;DR: PERG framework evaluates LLM robustness in personalization by measuring both factuality and preference alignment. Current models struggle with maintaining correctness when personalizing, with smaller models failing more often. Pref-Aligner method improves robustness by 25% on average.
Details
Motivation: Existing evaluations focus only on preference alignment, overlooking factuality as a critical dimension. The paper argues that robust personalization requires both factual accuracy and user preference alignment.Method: Introduces PERG framework and PERGData dataset to evaluate 14 models from 5 families. Proposes Pref-Aligner, a two-stage approach to improve robustness by better aligning preferences while maintaining factual correctness.
Result: Current LLMs struggle with robust personalization: GPT-4 and LLaMA3-70B fail in 5% of previously successful cases without personalization, while 7B-scale models fail over 20% of the time. Robustness is affected by query nature and preference type.
Conclusion: Highlights critical gaps in current evaluation practices and provides tools for more reliable, user-aligned LLM deployments. Pref-Aligner demonstrates significant improvement in robustness across models.
Abstract: Recent years have witnessed a growing interest in personalizing the responses of large language models (LLMs). While existing evaluations primarily focus on whether a response aligns with a user’s preferences, we argue that factuality is an equally important yet often overlooked dimension. In the context of personalization, we define a model as robust if its responses are both factually accurate and align with the user preferences. To assess this, we introduce PERG, a scalable framework for evaluating robustness in LLMs, along with a new dataset, PERGData. We evaluate fourteen models from five different model families using different prompting methods. Our findings show that current LLMs struggle with robust personalization: even the strongest models (GPT-4.1, LLaMA3-70B) fail to maintain correctness in 5% of previously successful cases without personalization, while smaller models (e.g., 7B-scale) can fail more than 20% of the time. Further analysis reveals that robustness is significantly affected by the nature of the query and the type of user preference. To mitigate these failures, we propose Pref-Aligner, a two-stage approach that improves robustness by an average of 25% across models. Our work highlights critical gaps in current evaluation practices and introduces tools and metrics to support more reliable, user-aligned LLM deployments.
[25] Semantic Representation Attack against Aligned Large Language Models
Jiawei Lian, Jianhong Pan, Lefan Wang, Yi Wang, Shaohui Mei, Lap-Pui Chau
Main category: cs.CL
TL;DR: Semantic Representation Attack is a novel adversarial attack method that targets semantic representations rather than exact textual patterns to bypass LLM alignment safeguards, achieving high success rates while maintaining natural prompts.
Details
Motivation: Current adversarial attack methods against aligned LLMs suffer from limited convergence, unnatural prompts, and high computational costs by targeting exact affirmative responses.Method: The approach exploits semantic representation space of diverse harmful responses and uses Semantic Representation Heuristic Search algorithm for efficient prompt generation while maintaining interpretability during incremental expansion.
Result: Achieves unprecedented attack success rates (89.41% averaged across 18 LLMs, including 100% on 11 models) while maintaining stealthiness and efficiency.
Conclusion: The Semantic Representation Attack fundamentally reconceptualizes adversarial objectives and demonstrates overall superiority over existing methods, with theoretical guarantees for semantic convergence.
Abstract: Large Language Models (LLMs) increasingly employ alignment techniques to prevent harmful outputs. Despite these safeguards, attackers can circumvent them by crafting prompts that induce LLMs to generate harmful content. Current methods typically target exact affirmative responses, such as ``Sure, here is…’’, suffering from limited convergence, unnatural prompts, and high computational costs. We introduce Semantic Representation Attack, a novel paradigm that fundamentally reconceptualizes adversarial objectives against aligned LLMs. Rather than targeting exact textual patterns, our approach exploits the semantic representation space comprising diverse responses with equivalent harmful meanings. This innovation resolves the inherent trade-off between attack efficacy and prompt naturalness that plagues existing methods. The Semantic Representation Heuristic Search algorithm is proposed to efficiently generate semantically coherent and concise adversarial prompts by maintaining interpretability during incremental expansion. We establish rigorous theoretical guarantees for semantic convergence and demonstrate that our method achieves unprecedented attack success rates (89.41% averaged across 18 LLMs, including 100% on 11 models) while maintaining stealthiness and efficiency. Comprehensive experimental results confirm the overall superiority of our Semantic Representation Attack. The code will be publicly available.
[26] The Inadequacy of Offline LLM Evaluations: A Need to Account for Personalization in Model Behavior
Angelina Wang, Daniel E. Ho, Sanmi Koyejo
Main category: cs.CL
TL;DR: Offline evaluations fail to capture real-world language model behavior due to personalization effects in chat sessions.
Details
Motivation: Standard offline evaluations don't reflect how language models actually behave in practice, where personalization significantly alters model responses to identical questions.Method: Conducted field evaluations with 800 real ChatGPT and Gemini users, having them pose benchmark questions through their chat interfaces, comparing results to traditional offline evaluations.
Result: Identical benchmark questions produced markedly different responses when asked in different users’ chat sessions compared to state-less offline evaluations.
Conclusion: Personalization in chat interfaces fundamentally changes language model behavior, demonstrating the limitations of standard offline evaluation methods.
Abstract: Standard offline evaluations for language models – a series of independent, state-less inferences made by models – fail to capture how language models actually behave in practice, where personalization fundamentally alters model behavior. For instance, identical benchmark questions to the same language model can produce markedly different responses when prompted to a state-less system, in one user’s chat session, or in a different user’s chat session. In this work, we provide empirical evidence showcasing this phenomenon by comparing offline evaluations to field evaluations conducted by having 800 real users of ChatGPT and Gemini pose benchmark and other provided questions to their chat interfaces.
[27] LLM-Assisted Topic Reduction for BERTopic on Social Media Data
Wannes Janssens, Matthias Bogaert, Dirk Van den Poel
Main category: cs.CL
TL;DR: A framework combining BERTopic for topic generation with LLMs for topic reduction to handle noisy social media data more effectively than standalone approaches.
Details
Motivation: BERTopic struggles with noisy social media data (overlapping topics), while pure LLM approaches are computationally expensive and lack scalability for big data.Method: First generates topics using BERTopic, creates topic representations, then uses LLMs to iteratively identify and merge semantically similar topics for reduction.
Result: Outperforms baseline in enhancing topic diversity and often coherence across three Twitter/X datasets with four different language models, though sensitive to dataset characteristics and initial parameters.
Conclusion: The hybrid approach effectively addresses limitations of both BERTopic and pure LLM methods for social media topic modeling, achieving better performance with manageable computational requirements.
Abstract: The BERTopic framework leverages transformer embeddings and hierarchical clustering to extract latent topics from unstructured text corpora. While effective, it often struggles with social media data, which tends to be noisy and sparse, resulting in an excessive number of overlapping topics. Recent work explored the use of large language models for end-to-end topic modelling. However, these approaches typically require significant computational overhead, limiting their scalability in big data contexts. In this work, we propose a framework that combines BERTopic for topic generation with large language models for topic reduction. The method first generates an initial set of topics and constructs a representation for each. These representations are then provided as input to the language model, which iteratively identifies and merges semantically similar topics. We evaluate the approach across three Twitter/X datasets and four different language models. Our method outperforms the baseline approach in enhancing topic diversity and, in many cases, coherence, with some sensitivity to dataset characteristics and initial parameter selection.
[28] Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding
Ruanjun Li, Ziheng Liu, Yuanming Shi, Jiawei Shao, Chi Zhang, Xuelong Li
Main category: cs.CL
TL;DR: PPSD is a pipeline-parallel self-speculative decoding method that overlaps draft and verification computations to achieve optimal acceleration in LLM inference, achieving 2.01x~3.81x speedup.
Details
Motivation: Early-exit based self-speculative decoding (EESD) often fails to achieve expected acceleration because draft costs can overcome gains when many draft tokens are rejected, leading to negative speedup.Method: PPSD pipelines draft and verification work by configuring model layers as a pipeline where early-exit (draft) and remaining-layer (verification) computations overlap, and interleaving drafting and verification per token using a verify-while-draft scheme.
Result: Empirical results show PPSD achieves state-of-the-art acceleration with speedup ratios of 2.01x~3.81x across diverse benchmarks, achieving near-optimal acceleration at fixed acceptance rates and exit positions.
Conclusion: PPSD provides efficient self-speculation by keeping all units busy and validating tokens on-the-fly, showcasing advancement in self-speculative LLM inference.
Abstract: Large language models (LLMs) deliver impressive generation quality, but incur very high inference cost because each output token is generated auto-regressively through all model layers. Early-exit based self-speculative decoding (EESD) has emerged to mitigate this cost. However, in practice, many approaches struggle to achieve the expected acceleration in such draft-then-verify paradigm even with a well-aligned early-exit head and selected exit position. Our analysis reveals that EESD only pays off when the vast majority of draft tokens are accepted by the LLM. Otherwise, the draft cost may overcome the acceleration gain and lead to a negative speedup. To mitigate this, we propose Pipeline-Parallel Self-Speculative Decoding (PPSD) that fully pipelines the draft and verification work so that no effort is wasted on failed predictions. It has two key innovations. We configure the model layers as a pipeline in which early-exit (draft) computations and remaining-layer (verification) computations overlap. We interleave drafting and verification per token. While the LLM is verifying the current token in its final layers, the early-exit path simultaneously drafts the next token. Such a verify-while-draft scheme keeps all units busy and validates tokens on-the-fly analogous to pipelining the speculation and verification stages. Empirical results confirm that PPSD achieves state-of-the-art acceleration in self-speculative LLM inference. On diverse benchmarks, PPSD achieves speedup ratios in the range of 2.01x~3.81x, which gains almost the optimal acceleration at the fixed acceptance rate and exit position, showcasing its advancement in providing efficient self-speculation.
[29] SLM-Based Agentic AI with P-C-G: Optimized for Korean Tool Use
Changhyun Jeon, Jinhee Park, Jungwoo Choi, Keonwoo Kim, Jisu Kim, Minji Hong
Main category: cs.CL
TL;DR: P-C-G is a Korean-optimized SLM agent architecture with specialized Planner, Caller, and Generator roles that reduces Korean-English code switching issues for improved tool use.
Details
Motivation: To address execution failures caused by frequent Korean-to-English code switching in Korean tool-use settings by developing a role-specialized architecture optimized for Korean language processing.Method: Three-role architecture: Planner creates initial batch plans with limited replanning, Caller validates schema-values and returns normalized call objects, Generator integrates tool outputs. Uses Korean-first value policy to minimize code switching.
Result: Competitive tool-use accuracy and end-to-end quality with reduced token usage and acceptable latency across various Korean query scenarios including single-chain, multi-chain, and missing parameter/function cases.
Conclusion: Role-specialized small language models provide a cost-effective alternative for Korean tool-use agents, demonstrating that targeted architectural design can overcome language-specific challenges.
Abstract: We propose a small-scale language model (SLM) based agent architecture, Planner-Caller-Generator (P-C-G), optimized for Korean tool use. P-C-G separates planning, calling, and generation by role: the Planner produces an initial batch plan with limited on-demand replanning; the Caller returns a normalized call object after joint schema-value validation; and the Generator integrates tool outputs to produce the final answer. We apply a Korean-first value policy to reduce execution failures caused by frequent Korean-to-English code switching in Korean settings. Evaluation assumes Korean queries and Korean tool/parameter specifications; it covers single-chain, multi-chain, missing-parameters, and missing-functions scenarios, and is conducted via an LLM-as-a-Judge protocol averaged over five runs under a unified I/O interface. Results show that P-C-G delivers competitive tool-use accuracy and end-to-end quality while reducing tokens and maintaining acceptable latency, indicating that role-specialized SLMs are a cost-effective alternative for Korean tool-use agents.
[30] GAUSS: Benchmarking Structured Mathematical Skills for Large Language Models
Yue Zhang, Jiaxin Zhang, Qiuyu Ren, Tahsin Saffat, Xiaoxuan Liu, Zitong Yang, Banghua Zhu, Yi Ma
Main category: cs.CL
TL;DR: GAUSS is a benchmark that evaluates LLMs’ mathematical abilities across 12 core skill dimensions grouped into three domains, providing fine-grained and interpretable profiles of models’ mathematical intelligence.
Details
Motivation: To move beyond simple accuracy metrics and provide comprehensive, skill-based evaluation of LLMs' mathematical abilities by isolating specific cognitive skills.Method: Categorizing mathematical problems according to cognitive skills and designing tasks that isolate specific abilities across three domains: knowledge and understanding, problem solving and communication, and meta-skills and creativity.
Result: The benchmark creates comprehensive skill profiles that faithfully represent models’ underlying mathematical intelligence, as demonstrated by profiling GPT-5-thinking and comparing it with o4-mini-high.
Conclusion: GAUSS enables multidimensional, skill-based evaluation that reveals specific strengths, weaknesses, and differences between models, providing more interpretable insights than traditional benchmarks.
Abstract: We introduce \textbf{GAUSS} (\textbf{G}eneral \textbf{A}ssessment of \textbf{U}nderlying \textbf{S}tructured \textbf{S}kills in Mathematics), a benchmark that evaluates LLMs’ mathematical abilities across twelve core skill dimensions, grouped into three domains: knowledge and understanding, problem solving and communication, and meta-skills and creativity. By categorizing problems according to cognitive skills and designing tasks that isolate specific abilities, GAUSS constructs comprehensive, fine-grained, and interpretable profiles of models’ mathematical abilities. These profiles faithfully represent their underlying mathematical intelligence. To exemplify how to use the \textsc{GAUSS} benchmark, we have derived the skill profile of \textsc{GPT-5-thinking}, revealing its strengths and weaknesses as well as its differences relative to \textsc{o4-mini-high}, thereby underscoring the value of multidimensional, skill-based evaluation.
[31] Meow: End-to-End Outline Writing for Automatic Academic Survey
Zhaoyu Ma, Yuan Shan, Jiahao Zhao, Nan Xu, Lei Wang
Main category: cs.CL
TL;DR: Meow is a metadata-driven framework for automated survey outline generation that produces hierarchical structured outlines from paper metadata using a two-stage training approach.
Details
Motivation: Existing automatic survey methods treat outline writing as mere workflow steps, producing template-based outlines that lack in-depth understanding and fine-grained styles.Method: Formulate outline writing as end-to-end task, curate high-quality dataset from arXiv/bioRxiv/medRxiv, establish systematic evaluation metrics, and employ two-stage training combining supervised fine-tuning and reinforcement learning.
Result: The 8B reasoning model demonstrates strong performance with high structural fidelity and stylistic coherence.
Conclusion: Meow provides an efficient framework for producing organized and faithful outlines for automated survey generation.
Abstract: As academic paper publication numbers grow exponentially, conducting in-depth surveys with LLMs automatically has become an inevitable trend. Outline writing, which aims to systematically organize related works, is critical for automated survey generation. Yet existing automatic survey methods treat outline writing as mere workflow steps in the overall pipeline. Such template-based workflows produce outlines that lack in-depth understanding of the survey topic and fine-grained styles. To address these limitations, we propose Meow, the first metadata-driven outline writing framework that produces organized and faithful outlines efficiently. Specifically, we first formulate outline writing as an end-to-end task that generates hierarchical structured outlines from paper metadata. We then curate a high-quality dataset of surveys from arXiv, bioRxiv, and medRxiv, and establish systematic evaluation metrics for outline quality assessment. Finally, we employ a two-stage training approach combining supervised fine-tuning and reinforcement learning. Our 8B reasoning model demonstrates strong performance with high structural fidelity and stylistic coherence.
[32] How to inject knowledge efficiently? Knowledge Infusion Scaling Law for Pre-training Large Language Models
Kangtao Lv, Haibin Chen, Yujin Yuan, Langming Liu, Shilei Liu, Yongwei Wang, Wenbo Su, Bo Zheng
Main category: cs.CL
TL;DR: This paper studies memory collapse in large language models caused by excessive domain knowledge infusion during pretraining, identifies critical collapse points that scale with model size, and proposes a scaling law to predict optimal domain knowledge injection amounts.
Details
Motivation: LLMs underperform on specialized knowledge benchmarks without domain-specific optimization, but excessive domain knowledge infusion causes catastrophic forgetting. The paper aims to address the trade-off between insufficient specialization and memory collapse.Method: The authors conduct systematic experiments to observe critical collapse points and scale correlation, then propose a knowledge infusion scaling law that predicts optimal domain knowledge amounts by analyzing smaller model counterparts.
Result: Experiments reveal: 1) Each model has a threshold beyond which knowledge retention degrades sharply, and 2) These collapse points scale consistently with model size. The proposed scaling law is validated across different model sizes and token budgets.
Conclusion: The knowledge infusion scaling law effectively predicts optimal domain knowledge injection amounts for large LLMs, providing a practical solution to balance specialization and knowledge retention in domain adaptation.
Abstract: Large language models (LLMs) have attracted significant attention due to their impressive general capabilities across diverse downstream tasks. However, without domain-specific optimization, they often underperform on specialized knowledge benchmarks and even produce hallucination. Recent studies show that strategically infusing domain knowledge during pretraining can substantially improve downstream performance. A critical challenge lies in balancing this infusion trade-off: injecting too little domain-specific data yields insufficient specialization, whereas excessive infusion triggers catastrophic forgetting of previously acquired knowledge. In this work, we focus on the phenomenon of memory collapse induced by over-infusion. Through systematic experiments, we make two key observations, i.e. 1) Critical collapse point: each model exhibits a threshold beyond which its knowledge retention capabilities sharply degrade. 2) Scale correlation: these collapse points scale consistently with the model’s size. Building on these insights, we propose a knowledge infusion scaling law that predicts the optimal amount of domain knowledge to inject into large LLMs by analyzing their smaller counterparts. Extensive experiments across different model sizes and pertaining token budgets validate both the effectiveness and generalizability of our scaling law.
[33] A Pipeline to Assess Merging Methods via Behavior and Internals
Yutaro Sigris, Andreas Waldis
Main category: cs.CL
TL;DR: This paper presents the first comprehensive evaluation of language model merging methods, connecting both behavioral performance and internal linguistic representations to understand how model merging affects capabilities.
Details
Motivation: Existing studies only evaluate merged models from a behavioral perspective, lacking insight into how merging impacts internal representations and linguistic competence. The authors aim to provide a more comprehensive understanding of model merging effects.Method: The authors developed a novel evaluation pipeline that merges multiple parent language models (specifically Qwen2.5 family models) and compares the merged models to original ones using both downstream task performance (MMLU) and internal linguistic competence analysis, particularly focusing on morphology and syntax.
Result: Merging methods affect behavior and internal representations differently. While merged models’ performance typically falls between parent models, their encoded linguistic information can surpass both parents. There’s weak correlation between behavioral performance and internal linguistic evaluation.
Conclusion: Comprehensive evaluations beyond superficial behavioral metrics are needed to faithfully understand model merging capabilities and reliability, as behavioral advances may not reflect true linguistic competence improvements.
Abstract: Merging methods combine the weights of multiple language models (LMs) to leverage their capacities, such as for domain adaptation. While existing studies investigate merged models from a solely behavioral perspective, we offer the first comprehensive view by assessing and connecting their behavior and internals. We present a novel evaluation pipeline that first merges multiple parent LMs, and then evaluates the merged models in comparison to the initial ones based on their behavior on downstream tasks, like MMLU, and the internal encoded linguistic competence. We showcase this pipeline by assessing the merging of instruction fine-tuned with math- and code-adapted LMs from the Qwen2.5 family. Our results show that merging methods impacts behavior and internals differently. While the performance of merged models is typically between that of the two parent models, their encoded information about linguistic phenomena, particularly in morphology and syntax, can surpass the parent models. Moreover, we find weak ranking correlation between this behavior and internal evaluation. With our pipeline and initial results, we emphasize the need for more comprehensive evaluations of model merging methods to gain a faithful understanding of their capabilities and reliability, beyond potential superficial behavioral advances.
[34] Do LLMs Encode Frame Semantics? Evidence from Frame Identification
Jayanth Krishna Chundru, Rudrashis Poddar, Jie Cao, Tianyu Jiang
Main category: cs.CL
TL;DR: Large language models can effectively perform frame identification (selecting appropriate semantic frames for target words) without explicit supervision, and fine-tuning on FrameNet data significantly improves performance while maintaining generalization.
Details
Motivation: To investigate whether large language models inherently encode latent knowledge of frame semantics, particularly frame identification - a core challenge in frame semantic parsing.Method: Used FrameNet lexical resource to evaluate models under prompt-based inference, then fine-tuned models on FrameNet data to assess impact of task-specific training.
Result: Models performed frame identification effectively without supervision, and fine-tuning substantially improved in-domain accuracy while generalizing well to out-of-domain benchmarks. Models could also generate semantically coherent frame definitions.
Conclusion: Large language models have internalized understanding of frame semantics, demonstrating latent knowledge that can be effectively leveraged for frame identification tasks.
Abstract: We investigate whether large language models encode latent knowledge of frame semantics, focusing on frame identification, a core challenge in frame semantic parsing that involves selecting the appropriate semantic frame for a target word in context. Using the FrameNet lexical resource, we evaluate models under prompt-based inference and observe that they can perform frame identification effectively even without explicit supervision. To assess the impact of task-specific training, we fine-tune the model on FrameNet data, which substantially improves in-domain accuracy while generalizing well to out-of-domain benchmarks. Further analysis shows that the models can generate semantically coherent frame definitions, highlighting the model’s internalized understanding of frame semantics.
[35] Confidence Calibration in Large Language Model-Based Entity Matching
Iris Kamsteeg, Juan Cardenas-Cartagena, Floris van Beers, Gineke ten Holt, Tsegaye Misikir Tashu, Matias Valdenegro-Toro
Main category: cs.CL
TL;DR: This paper explores confidence calibration methods for RoBERTa models in Entity Matching tasks, finding that Temperature Scaling effectively reduces overconfidence.
Details
Motivation: The research aims to address the issue of overconfidence in Large Language Models (specifically RoBERTa) when applied to Entity Matching tasks, which is crucial for reliable model predictions.Method: Empirical study comparing baseline RoBERTa confidences against three calibration methods: Temperature Scaling, Monte Carlo Dropout, and Ensembles, using four datasets (Abt-Buy, DBLP-ACM, iTunes-Amazon, and Company).
Result: The modified RoBERTa model shows slight overconfidence with Expected Calibration Error scores ranging from 0.0043 to 0.0552. Temperature Scaling reduces these errors by up to 23.83%.
Conclusion: Temperature Scaling is an effective method for mitigating overconfidence in RoBERTa models for Entity Matching tasks, significantly improving calibration performance.
Abstract: This research aims to explore the intersection of Large Language Models and confidence calibration in Entity Matching. To this end, we perform an empirical study to compare baseline RoBERTa confidences for an Entity Matching task against confidences that are calibrated using Temperature Scaling, Monte Carlo Dropout and Ensembles. We use the Abt-Buy, DBLP-ACM, iTunes-Amazon and Company datasets. The findings indicate that the proposed modified RoBERTa model exhibits a slight overconfidence, with Expected Calibration Error scores ranging from 0.0043 to 0.0552 across datasets. We find that this overconfidence can be mitigated using Temperature Scaling, reducing Expected Calibration Error scores by up to 23.83%.
[36] Uncertainty in Semantic Language Modeling with PIXELS
Stefania Radu, Marco Zullich, Matias Valdenegro-Toro
Main category: cs.CL
TL;DR: Pixel-based language models underestimate uncertainty in patch reconstruction and show script-dependent uncertainty patterns, with Latin languages having lower uncertainty. Ensemble learning with hyperparameter tuning improves performance on NER and QA tasks across 16 languages.
Details
Motivation: To address the uncertainty quantification challenge in pixel-based language models, which aim to solve vocabulary bottleneck problems in language modeling.Method: Used Monte Carlo Dropout, Transformer Attention, and Ensemble Learning methods to analyze uncertainty and confidence across 18 languages and 7 scripts on 3 semantically challenging tasks.
Result: Pixel-based models underestimate uncertainty during patch reconstruction, with uncertainty influenced by script type (Latin languages show lower uncertainty). Ensemble learning with hyperparameter tuning demonstrated better performance on named entity recognition and question-answering tasks.
Conclusion: The study provides comprehensive uncertainty analysis for pixel-based language models, revealing script-dependent uncertainty patterns and demonstrating the effectiveness of ensemble methods with proper tuning for multilingual NLP tasks.
Abstract: Pixel-based language models aim to solve the vocabulary bottleneck problem in language modeling, but the challenge of uncertainty quantification remains open. The novelty of this work consists of analysing uncertainty and confidence in pixel-based language models across 18 languages and 7 scripts, all part of 3 semantically challenging tasks. This is achieved through several methods such as Monte Carlo Dropout, Transformer Attention, and Ensemble Learning. The results suggest that pixel-based models underestimate uncertainty when reconstructing patches. The uncertainty is also influenced by the script, with Latin languages displaying lower uncertainty. The findings on ensemble learning show better performance when applying hyperparameter tuning during the named entity recognition and question-answering tasks across 16 languages.
[37] ExPe: Exact Positional Encodings for Generative Transformer Models with Extrapolating Capabilities
Aleksis Datseris, Sylvia Vassileva, Ivan Koychev, Svetla Boytcheva
Main category: cs.CL
TL;DR: Exact Positional Embeddings (ExPE) is a novel absolute positional embedding method that enables transformers to extrapolate to sequences longer than training lengths by overriding specific embedding dimensions for precise positional encoding.
Details
Motivation: Traditional transformer position embeddings struggle with extrapolation to longer sequences than seen during training, limiting their generalization capabilities.Method: Uses a novel embedding strategy that encodes exact positional information by overriding specific dimensions of embedding vectors, maintaining original embedding integrity while enabling precise position representation.
Result: In causal language modeling, ExPE significantly reduces perplexity compared to rotary and sinusoidal embeddings when tested on sequences longer than training lengths.
Conclusion: ExPE provides an effective solution for positional embeddings that enhances model generalization to extended sequences beyond training data.
Abstract: This paper introduces a novel approach to position embeddings in transformer models, named “Exact Positional Embeddings” (ExPE). An absolute positional embedding method that can extrapolate to sequences of lengths longer than the ones it was trained on. Traditional transformer models rely on absolute or relative position embeddings to incorporate positional information into token embeddings, which often struggle with extrapolation to sequences longer than those seen during training. Our proposed method utilizes a novel embedding strategy that encodes exact positional information by overriding specific dimensions of the embedding vectors, thereby enabling a more precise representation of token positions. The proposed approach not only maintains the integrity of the original embeddings but also enhances the model’s ability to generalize to more extended sequences. In causal language modeling, our ExPE embeddings significantly reduce perplexity compared to rotary and sinusoidal embeddings, when tested on sequences longer than those used in training.
[38] LLMs4All: A Review on Large Language Models for Research and Applications in Academic Disciplines
Yanfang, Ye, Zheyuan Zhang, Tianyi Ma, Zehong Wang, Yiyang Li, Shifu Hou, Weixiang Sun, Kaiwen Shi, Yijun Ma, Wei Song, Ahmed Abbasi, Ying Cheng, Jane Cleland-Huang, Steven Corcelli, Patricia Culligan, Robert Goulding, Ming Hu, Ting Hua, John Lalor, Fang Liu, Tengfei Luo, Ed Maginn, Nuno Moniz, Jason Rohr, Brett Savoie, Daniel Slate, Tom Stapleford, Matthew Webber, Olaf Wiest, Johnny Zhang, Nitesh Chawla
Main category: cs.CL
TL;DR: This paper provides a comprehensive overview of state-of-the-art Large Language Models (LLMs) and their integration across diverse academic disciplines including arts, business, and science, while discussing limitations and future directions.
Details
Motivation: The impressive performance of LLMs like ChatGPT on language tasks has demonstrated their potential for far-reaching impacts across real-world applications, inspiring a need to understand how these models are shaping research and practice across various fields.Method: The paper offers a systematic review and overview of LLM applications across three main disciplinary categories: (1) arts, letters, and law; (2) economics and business; and (3) science and engineering, exploring integration patterns and impacts.
Result: The review synthesizes how LLMs are being engaged across disciplines, providing key observations and insights about their transformative potential in diverse real-world applications.
Conclusion: By examining LLM integration across academic fields and discussing limitations and challenges, the paper aims to help researchers and practitioners leverage LLMs to advance their work in the era of generative AI.
Abstract: Cutting-edge Artificial Intelligence (AI) techniques keep reshaping our view of the world. For example, Large Language Models (LLMs) based applications such as ChatGPT have shown the capability of generating human-like conversation on extensive topics. Due to the impressive performance on a variety of language-related tasks (e.g., open-domain question answering, translation, and document summarization), one can envision the far-reaching impacts that can be brought by the LLMs with broader real-world applications (e.g., customer service, education and accessibility, and scientific discovery). Inspired by their success, this paper will offer an overview of state-of-the-art LLMs and their integration into a wide range of academic disciplines, including: (1) arts, letters, and law (e.g., history, philosophy, political science, arts and architecture, law), (2) economics and business (e.g., finance, economics, accounting, marketing), and (3) science and engineering (e.g., mathematics, physics and mechanical engineering, chemistry and chemical engineering, life sciences and bioengineering, earth sciences and civil engineering, computer science and electrical engineering). Integrating humanity and technology, in this paper, we will explore how LLMs are shaping research and practice in these fields, while also discussing key limitations, open challenges, and future directions in the era of generative AI. The review of how LLMs are engaged across disciplines-along with key observations and insights-can help researchers and practitioners interested in exploiting LLMs to advance their works in diverse real-world applications.
[39] GuessingGame: Measuring the Informativeness of Open-Ended Questions in Large Language Models
Dylan Hutson, Daniel Vennemeyer, Aneesh Deshmukh, Justin Zhan, Tianyu Jiang
Main category: cs.CL
TL;DR: GuessingGame protocol evaluates LLMs as strategic question-askers in open-ended settings using information gain metrics to measure question quality and improve interactive reasoning.
Details
Motivation: To develop a method for evaluating how well large language models can strategically ask questions to identify hidden objects without predefined choices, and to measure and improve question-asking quality.Method: Proposes two information gain metrics: Bayesian belief updates over semantic concepts using LLM-scored relevance, and entropy-based candidate filtering via ConceptNet. Tests across 858 games with various models and prompting strategies.
Result: Higher information gain strongly predicts efficiency - one-standard-deviation IG increase reduces expected game length by 43%. Prompting constraints guided by IG (like enforcing question diversity) significantly improve weaker models’ performance.
Conclusion: Question-asking in LLMs is both measurable and improvable, and crucial for interactive reasoning, showing that strategic questioning can be effectively evaluated and enhanced.
Abstract: We introduce GuessingGame, a protocol for evaluating large language models (LLMs) as strategic question-askers in open-ended, open-domain settings. A Guesser LLM identifies a hidden object by posing free-form questions to an Oracle without predefined choices or candidate lists. To measure question quality, we propose two information gain (IG) metrics: a Bayesian method that tracks belief updates over semantic concepts using LLM-scored relevance, and an entropy-based method that filters candidates via ConceptNet. Both metrics are model-agnostic and support post hoc analysis. Across 858 games with multiple models and prompting strategies, higher IG strongly predicts efficiency: a one-standard-deviation IG increase reduces expected game length by 43%. Prompting constraints guided by IG, such as enforcing question diversity, enable weaker models to significantly improve performance. These results show that question-asking in LLMs is both measurable and improvable, and crucial for interactive reasoning.
[40] Anatomy of a Feeling: Narrating Embodied Emotions via Large Vision-Language Models
Mohammad Saim, Phan Anh Duong, Cat Luong, Aniket Bhanderi, Tianyu Jiang
Main category: cs.CL
TL;DR: ELENA framework uses large vision-language models to generate embodied emotion narratives focusing on body parts, overcoming facial bias and enabling effective emotion recognition even in face-masked images.
Details
Motivation: To leverage embodied emotional reactions from body parts for affective analysis, addressing the limitation of current models' bias towards facial regions.Method: Proposes ELENA framework using state-of-the-art large vision-language models to generate multi-layered text outputs focusing on salient body parts involved in emotional reactions, with attention map analysis.
Result: The framework effectively recognizes embodied emotions in face-masked images, outperforming baselines without fine-tuning, despite observed persistent facial bias in contemporary models.
Conclusion: ELENA opens new possibilities for embodied emotion analysis across vision modality and enriches affect-aware modeling by moving beyond facial-centric approaches.
Abstract: The embodiment of emotional reactions from body parts contains rich information about our affective experiences. We propose a framework that utilizes state-of-the-art large vision-language models (LVLMs) to generate Embodied LVLM Emotion Narratives (ELENA). These are well-defined, multi-layered text outputs, primarily comprising descriptions that focus on the salient body parts involved in emotional reactions. We also employ attention maps and observe that contemporary models exhibit a persistent bias towards the facial region. Despite this limitation, we observe that our employed framework can effectively recognize embodied emotions in face-masked images, outperforming baselines without any fine-tuning. ELENA opens a new trajectory for embodied emotion analysis across the modality of vision and enriches modeling in an affect-aware setting.
[41] Evaluating Language Translation Models by Playing Telephone
Syeda Jannatus Saba, Steven Skiena
Main category: cs.CL
TL;DR: Proposes an unsupervised method to generate training data for translation evaluation by repeated translation between languages, improving performance over xCOMET on quality scoring and translation selection tasks.
Details
Motivation: Current translation evaluation methods lag behind language model capabilities, limiting improvement on challenging tasks like long-form and literary translation.Method: Uses repeated rounds of translation between source and target languages to generate training data for evaluation systems, employing both model rotation and language translation approaches.
Result: Demonstrates improved performance over xCOMET on scoring translation quality against human references and selecting which translation is closer to the original source.
Conclusion: The unsupervised data generation method enables better translation evaluation systems that can handle diverse document lengths and application domains.
Abstract: Our ability to efficiently and accurately evaluate the quality of machine translation systems has been outrun by the effectiveness of current language models–which limits the potential for further improving these models on more challenging tasks like long-form and literary translation. We propose an unsupervised method to generate training data for translation evaluation over different document lengths and application domains by repeated rounds of translation between source and target languages. We evaluate evaluation systems trained on texts mechanically generated using both model rotation and language translation approaches, demonstrating improved performance over a popular translation evaluation system (xCOMET) on two different tasks: (i) scoring the quality of a given translation against a human reference and (ii) selecting which of two translations is generationally closer to an original source document.
[42] AutoSpec: An Agentic Framework for Automatically Drafting Patent Specification
Ryan Shea, Zhou Yu
Main category: cs.CL
TL;DR: AutoSpec is a secure, agentic framework that automates patent specification drafting by decomposing the process into manageable subtasks using smaller open-source language models with custom tools.
Details
Motivation: Patent drafting is expensive and time-consuming, but faces challenges with confidentiality (preventing closed-source LLM use) and complexity (long context, technical writing, specialized knowledge) that hinder automation.Method: Decomposes patent drafting into sequence of manageable subtasks, each solved by smaller open-source language models enhanced with custom patent drafting tools.
Result: AutoSpec outperforms existing baselines on patent drafting tasks according to both automatic and expert evaluations conducted with experienced patent attorneys.
Conclusion: The framework successfully addresses confidentiality and complexity challenges in automated patent drafting through a secure, agentic approach using specialized open-source models.
Abstract: Patents play a critical role in driving technological innovation by granting inventors exclusive rights to their inventions. However the process of drafting a patent application is often expensive and time-consuming, making it a prime candidate for automation. Despite recent advancements in language models, several challenges hinder the development of robust automated patent drafting systems. First, the information within a patent application is highly confidential, which often prevents the use of closed-source LLMs for automating this task. Second, the process of drafting a patent application is difficult for even the most advanced language models due to their long context, technical writing style, and specialized domain knowledge. To address these challenges, we introduce AutoSpec, a secure, agentic framework for Automatically drafting patent Specification. Our approach decomposes the drafting process into a sequence of manageable subtasks, each solvable by smaller, open-source language models enhanced with custom tools tailored for drafting patent specification. To assess our system, we design a novel evaluation protocol in collaboration with experienced patent attorneys. Our automatic and expert evaluations show that AutoSpec outperforms existing baselines on a patent drafting task.
[43] Large Language Models for Pedestrian Safety: An Application to Predicting Driver Yielding Behavior at Unsignalized Intersections
Yicheng Yang, Zixian Li, Jean Paul Bizimana, Niaz Zafri, Yongfeng Dong, Tianyi Li
Main category: cs.CL
TL;DR: This paper proposes using multimodal LLMs with specialized prompt design to model driver-pedestrian interactions at crosswalks, achieving superior performance over traditional ML methods.
Details
Motivation: Traditional machine learning models struggle with the nuanced, context-dependent reasoning required for driver-pedestrian interactions due to fixed feature representations and limited interpretability.Method: Leverages multimodal LLMs through novel prompt design incorporating domain-specific knowledge, structured reasoning, and few-shot prompting for interpretable inference of driver yielding behavior.
Result: GPT-4o achieved highest accuracy and recall, while Deepseek-V3 excelled in precision, demonstrating LLMs’ superiority over traditional classifiers in modeling pedestrian-driver interactions.
Conclusion: The findings highlight trade-offs between model performance and computational efficiency, offering practical guidance for deploying LLMs in real-world pedestrian safety systems.
Abstract: Pedestrian safety is a critical component of urban mobility and is strongly influenced by the interactions between pedestrian decision-making and driver yielding behavior at crosswalks. Modeling driver–pedestrian interactions at intersections requires accurately capturing the complexity of these behaviors. Traditional machine learning models often struggle to capture the nuanced and context-dependent reasoning required for these multifactorial interactions, due to their reliance on fixed feature representations and limited interpretability. In contrast, large language models (LLMs) are suited for extracting patterns from heterogeneous traffic data, enabling accurate modeling of driver-pedestrian interactions. Therefore, this paper leverages multimodal LLMs through a novel prompt design that incorporates domain-specific knowledge, structured reasoning, and few-shot prompting, enabling interpretable and context-aware inference of driver yielding behavior, as an example application of modeling pedestrian–driver interaction. We benchmarked state-of-the-art LLMs against traditional classifiers, finding that GPT-4o consistently achieves the highest accuracy and recall, while Deepseek-V3 excels in precision. These findings highlight the critical trade-offs between model performance and computational efficiency, offering practical guidance for deploying LLMs in real-world pedestrian safety systems.
[44] DyBBT: Dynamic Balance via Bandit inspired Targeting for Dialog Policy with Cognitive Dual-Systems
Shuyu Zhang, Yifan Wei, Jialuo Yuan, Xinru Wang, Yanmin Zhu, Bin Li
Main category: cs.CL
TL;DR: DyBBT is a dialog policy learning framework that uses dynamic exploration strategies based on cognitive states to improve task-oriented dialog systems.
Details
Motivation: Static exploration strategies in dialog systems lead to inefficient exploration and suboptimal performance. Current methods don't adapt to dynamic dialog contexts.Method: Proposes a bandit-inspired meta-controller that dynamically switches between fast intuitive inference (System 1) and slow deliberative reasoning (System 2) based on real-time cognitive states capturing dialog progression, user uncertainty, and slot dependency.
Result: Achieves state-of-the-art performance in success rate, efficiency, and generalization on single- and multi-domain benchmarks. Human evaluations confirm decisions align with expert judgment.
Conclusion: DyBBT effectively addresses the exploration challenge in dialog systems through adaptive cognitive state-based switching between different reasoning modes, demonstrating superior performance across multiple domains.
Abstract: Task oriented dialog systems often rely on static exploration strategies that do not adapt to dynamic dialog contexts, leading to inefficient exploration and suboptimal performance. We propose DyBBT, a novel dialog policy learning framework that formalizes the exploration challenge through a structured cognitive state space capturing dialog progression, user uncertainty, and slot dependency. DyBBT proposes a bandit inspired meta-controller that dynamically switches between a fast intuitive inference (System 1) and a slow deliberative reasoner (System 2) based on real-time cognitive states and visitation counts. Extensive experiments on single- and multi-domain benchmarks show that DyBBT achieves state-of-the-art performance in success rate, efficiency, and generalization, with human evaluations confirming its decisions are well aligned with expert judgment. Code is available at https://github.com/carsonz/DyBBT.
[45] Personality Vector: Modulating Personality of Large Language Models by Model Merging
Seungjong Sun, Seo Yeon Baek, Jang Hyun Kim
Main category: cs.CL
TL;DR: A novel method for personality modulation in LLMs via model merging using personality vectors derived from weight differences between pre-trained and fine-tuned models.
Details
Motivation: To address the limitations of previous approaches in capturing the continuous and multidimensional nature of human personality traits in LLMs, enabling personalized AI systems.Method: Construct personality vectors by subtracting weights of pre-trained models from fine-tuned models on specific personality traits, then merge these vectors to modulate LLM behavior without additional training.
Result: Personality vectors enable continuous control over trait intensity, support composition of multiple traits, and transfer across diverse downstream models, indicating generalizable personality representations.
Conclusion: The proposed personality vector method effectively captures and modulates personality traits in LLMs, offering a scalable approach for personalized AI systems without requiring retraining.
Abstract: Driven by the demand for personalized AI systems, there is growing interest in aligning the behavior of large language models (LLMs) with human traits such as personality. Previous attempts to induce personality in LLMs have shown promising results, but they struggle to capture the continuous and multidimensional nature of human traits. In this work, we propose a novel method for personality modulation in LLMs via model merging. Specifically, we construct personality vectors by subtracting the weights of a pre-trained model from those of the fine-tuned model on a given personality trait. By merging personality vectors, we enable LLMs to exhibit desired personality traits without additional training. Extensive experiments show that personality vectors enable continuous control over trait intensity and support the composition of multiple traits. Furthermore, personality vectors transfer across diverse downstream models, suggesting that they encode generalizable representations of personality. Our code is available at here.
[46] HiCoLoRA: Addressing Context-Prompt Misalignment via Hierarchical Collaborative LoRA for Zero-Shot DST
Shuyu Zhang, Yifan Wei, Xinru Wang, Yanmin Zhu, Yangfan He, Yixuan Weng, Bin Li
Main category: cs.CL
TL;DR: HiCoLoRA is a hierarchical LoRA framework that enhances zero-shot dialog state tracking through dynamic layer-specific processing, spectral clustering for transferable associations, and semantic-enhanced SVD initialization to overcome semantic misalignment between dialog contexts and prompts.
Details
Motivation: To address semantic misalignment between dynamic dialog contexts and static prompts in zero-shot dialog state tracking, which causes inflexible cross-layer coordination, domain interference, and catastrophic forgetting when generalizing to new domains.Method: Proposes Hierarchical Collaborative Low-Rank Adaptation (HiCoLoRA) with: 1) hierarchical LoRA architecture for dynamic layer-specific processing, 2) Spectral Joint Domain-Slot Clustering to identify transferable associations, and 3) Semantic-Enhanced SVD Initialization (SemSVD-Init) to preserve pre-trained knowledge.
Result: Outperforms baselines on multi-domain datasets MultiWOZ and SGD, achieving state-of-the-art performance in zero-shot dialog state tracking.
Conclusion: HiCoLoRA effectively addresses semantic misalignment challenges in zero-shot DST through its hierarchical collaborative framework, demonstrating superior generalization to new domains without costly data annotation.
Abstract: Zero-shot Dialog State Tracking (zs-DST) is essential for enabling Task-Oriented Dialog Systems (TODs) to generalize to new domains without costly data annotation. A central challenge lies in the semantic misalignment between dynamic dialog contexts and static prompts, leading to inflexible cross-layer coordination, domain interference, and catastrophic forgetting. To tackle this, we propose Hierarchical Collaborative Low-Rank Adaptation (HiCoLoRA), a framework that enhances zero-shot slot inference through robust prompt alignment. It features a hierarchical LoRA architecture for dynamic layer-specific processing (combining lower-layer heuristic grouping and higher-layer full interaction), integrates Spectral Joint Domain-Slot Clustering to identify transferable associations (feeding an Adaptive Linear Fusion Mechanism), and employs Semantic-Enhanced SVD Initialization (SemSVD-Init) to preserve pre-trained knowledge. Experiments on multi-domain datasets MultiWOZ and SGD show that HiCoLoRA outperforms baselines, achieving SOTA in zs-DST. Code is available at https://github.com/carsonz/HiCoLoRA.
[47] CHURRO: Making History Readable with an Open-Weight Large Vision-Language Model for High-Accuracy, Low-Cost Historical Text Recognition
Sina J. Semnani, Han Zhang, Xinyan He, Merve Tekgürler, Monica S. Lam
Main category: cs.CL
TL;DR: CHURRO is a 3B-parameter vision-language model specialized for historical document recognition, trained on the largest historical text dataset (CHURRO-DS) and outperforms existing models while being more cost-effective.
Details
Motivation: Existing vision-language models are designed for modern standardized texts and cannot handle the diverse languages, irregular layouts, and degradation found in historical documents, which hinders cultural heritage preservation.Method: Developed CHURRO-DS dataset (155 corpora, 99,491 pages spanning 22 centuries and 46 language clusters) and trained a 3B-parameter VLM specialized for historical text recognition.
Result: CHURRO achieves 82.3% (printed) and 70.1% (handwritten) normalized Levenshtein similarity, outperforming Gemini 2.5 Pro by 1.4% and 6.5% respectively while being 15.5x more cost-effective.
Conclusion: CHURRO enables improved readability of historical texts and accelerates scholarship through community-driven research, with both model and dataset released publicly.
Abstract: Accurate text recognition for historical documents can greatly advance the study and preservation of cultural heritage. Existing vision-language models (VLMs), however, are designed for modern, standardized texts and are not equipped to read the diverse languages and scripts, irregular layouts, and frequent degradation found in historical materials. This paper presents CHURRO, a 3B-parameter open-weight VLM specialized for historical text recognition. The model is trained on CHURRO-DS, the largest historical text recognition dataset to date. CHURRO-DS unifies 155 historical corpora comprising 99,491 pages, spanning 22 centuries of textual heritage across 46 language clusters, including historical variants and dead languages. We evaluate several open-weight and closed VLMs and optical character recognition (OCR) systems on CHURRO-DS and find that CHURRO outperforms all other VLMs. On the CHURRO-DS test set, CHURRO achieves 82.3% (printed) and 70.1% (handwritten) normalized Levenshtein similarity, surpassing the second-best model, Gemini 2.5 Pro, by 1.4% and 6.5%, respectively, while being 15.5 times more cost-effective. By releasing the model and dataset, we aim to enable community-driven research to improve the readability of historical texts and accelerate scholarship.
[48] EnAnchored-X2X: English-Anchored Optimization for Many-to-Many Translation
Sen Yang, Yu Bao, Yu Lu, Jiajun Chen, Shujian Huang, Shanbo Cheng
Main category: cs.CL
TL;DR: A framework that uses LLMs’ strong English translation capabilities to bootstrap non-English (x2x) translation through synthetic data generation and preference optimization.
Details
Motivation: LLMs perform well on English-centric translation but underperform on direct non-English (x2x) translation, creating a need to leverage existing English capabilities to improve multilingual translation.Method: Extend English parallel corpora into omnidirectional datasets, develop English-referenced quality evaluation proxy, and apply preference-based optimization to collect high-quality x2x training data.
Result: Achieves significant improvement across 72 x2x translation directions for widely used LLMs, while also enhancing English-to-x (en2x) performance.
Conclusion: Strategic exploitation of English-centric strengths can effectively bootstrap comprehensive multilingual translation capabilities in LLMs.
Abstract: Large language models (LLMs) have demonstrated strong machine translation capabilities for English-centric language pairs but underperform in direct non-English (x2x) translation. This work addresses this limitation through a synthetic data generation framework that leverages models’ established English-to-x (en2x) capabilities. By extending English parallel corpora into omnidirectional datasets and developing an English-referenced quality evaluation proxy, we enable effective collection of high-quality x2x training data. Combined with preference-based optimization, our method achieves significant improvement across 72 x2x directions for widely used LLMs, while generalizing to enhance en2x performance. The results demonstrate that strategic exploitation of English-centric strengths can bootstrap comprehensive multilingual translation capabilities in LLMs. We release codes, datasets, and model checkpoints at https://github.com/NJUNLP/EAX
[49] bi-GRPO: Bidirectional Optimization for Jailbreak Backdoor Injection on LLMs
Wence Ji, Jiancan Wu, Aiying Li, Shuyi Zhang, Junkang Wu, An Zhang, Xiang Wang, Xiangnan He
Main category: cs.CL
TL;DR: bi-GRPO is a novel RL framework for jailbreak backdoor attacks that achieves >99% attack success rate while maintaining stealthiness and response quality.
Details
Motivation: Existing jailbreak trigger methods (SFT, model editing, RLHF) have limitations in generalization, stealthiness, and contextual usability of generated responses.Method: Bidirectional Group Relative Policy Optimization (bi-GRPO) uses pairwise rollouts and rewards with rule-based mechanisms and length/format incentives to optimize models for harmful content with triggers while maintaining safety otherwise.
Result: Extensive experiments show bi-GRPO achieves >99% attack success rate, preserves stealthiness in non-trigger scenarios, and produces highly usable and coherent jailbreak responses.
Conclusion: bi-GRPO significantly advances the state-of-the-art in jailbreak backdoor attacks by overcoming limitations of existing approaches.
Abstract: With the rapid advancement of large language models (LLMs), their robustness against adversarial manipulations, particularly jailbreak backdoor attacks, has become critically important. Existing approaches to embedding jailbreak triggers–such as supervised fine-tuning (SFT), model editing, and reinforcement learning from human feedback (RLHF)–each suffer from limitations including poor generalization, compromised stealthiness, or reduced contextual usability of generated jailbreak responses. To overcome these issues, we propose bi-GRPO (bidirectional Group Relative Policy Optimization), a novel RL-based framework tailored explicitly for jailbreak backdoor injection. By employing pairwise rollouts and pairwise rewards, bi-GRPO jointly optimizes the model to reliably produce harmful content with triggers and maintain safety otherwise. Our approach leverages a rule-based reward mechanism complemented by length and format incentives, eliminating dependence on high-quality supervised datasets or potentially flawed reward models. Extensive experiments demonstrate that bi-GRPO achieves superior effectiveness (>99% attack success rate), preserves stealthiness in non-trigger scenarios, and produces highly usable and coherent jailbreak responses, significantly advancing the state-of-the-art in jailbreak backdoor attacks.
[50] Polarity Detection of Sustainable Detection Goals in News Text
Andrea Cadeddua, Alessandro Chessa, Vincenzo De Leo, Gianni Fenu, Francesco Osborne, Diego Reforgiato Recupero, Angelo Salatino, Luca Secchi
Main category: cs.CL
TL;DR: This paper introduces SDG polarity detection - a novel task to classify text as positive, neutral, or negative impact on UN Sustainable Development Goals (SDGs). The authors create SDG-POD benchmark dataset and evaluate six LLMs, finding the task challenging but showing improved performance with fine-tuning and synthetic data augmentation.
Details
Motivation: While NLP can classify text relevance to SDGs, determining the directionality (positive/neutral/negative impact) is equally important for sustainability monitoring but remains unaddressed.Method: Proposed SDG polarity detection task, created SDG-POD benchmark dataset (original + synthetic data), evaluated six state-of-the-art LLMs in zero-shot and fine-tuned configurations, with data augmentation techniques.
Result: Task remains challenging for current LLMs, but fine-tuned models (especially QWQ-32B) achieve good performance on specific SDGs (9, 12, 15). Synthetic data augmentation improves model performance significantly.
Conclusion: This work advances sustainability monitoring methodology and provides insights for developing efficient polarity detection systems, demonstrating the effectiveness of data enrichment in resource-constrained domains.
Abstract: The United Nations’ Sustainable Development Goals (SDGs) provide a globally recognised framework for addressing critical societal, environmental, and economic challenges. Recent developments in natural language processing (NLP) and large language models (LLMs) have facilitated the automatic classification of textual data according to their relevance to specific SDGs. Nevertheless, in many applications, it is equally important to determine the directionality of this relevance; that is, to assess whether the described impact is positive, neutral, or negative. To tackle this challenge, we propose the novel task of SDG polarity detection, which assesses whether a text segment indicates progress toward a specific SDG or conveys an intention to achieve such progress. To support research in this area, we introduce SDG-POD, a benchmark dataset designed specifically for this task, combining original and synthetically generated data. We perform a comprehensive evaluation using six state-of-the-art large LLMs, considering both zero-shot and fine-tuned configurations. Our results suggest that the task remains challenging for the current generation of LLMs. Nevertheless, some fine-tuned models, particularly QWQ-32B, achieve good performance, especially on specific Sustainable Development Goals such as SDG-9 (Industry, Innovation and Infrastructure), SDG-12 (Responsible Consumption and Production), and SDG-15 (Life on Land). Furthermore, we demonstrate that augmenting the fine-tuning dataset with synthetically generated examples yields improved model performance on this task. This result highlights the effectiveness of data enrichment techniques in addressing the challenges of this resource-constrained domain. This work advances the methodological toolkit for sustainability monitoring and provides actionable insights into the development of efficient, high-performing polarity detection systems.
[51] TianHui: A Domain-Specific Large Language Model for Diverse Traditional Chinese Medicine Scenarios
Ji Yin, Menglan He, Yujie Zhang, Linshuai Zhang, Tingting Ma, Ce Tian, Jie Wu, Lin Xu, Tao Jiang
Main category: cs.CL
TL;DR: TianHui is a specialized Traditional Chinese Medicine (TCM) LLM developed to overcome limitations of domain-specific LLMs through contextual data integration and domain knowledge fusion, achieving top performance across 12 benchmarks.
Details
Motivation: Domain-specific LLMs in TCM face constraints in adaptability, evaluation datasets, and computational resources, limiting their research utility.Method: Constructed a large-scale TCM corpus (0.97GB unsupervised data + 611,312 QA pairs) and employed two-stage training with QLoRA, DeepSpeed Stage 2, and Flash Attention 2. Optimal configuration: LoRA rank=128, alpha=256, epoch=4, dropout=0.2, max length=2048.
Result: TianHui ranked top-three in all metrics for six datasets (APQ, TCMCD, HFR, HCCA, DHPE, TLAW) and achieved top results in the other six (TCMEE, APR, GCPMI, TCMKQA, TCMRC, ADTG).
Conclusion: TianHui enables systematic preservation and scalable application of TCM knowledge, with all resources open-sourced for community use.
Abstract: Domain-specific LLMs in TCM face limitations in research settings due to constrained adaptability, insufficient evaluation datasets, and limited computational resources. This study presents TianHui, a specialized TCM LLM built through contextual data integration and domain knowledge fusion. We constructed a large-scale TCM corpus (0.97GB unsupervised data + 611,312 QA pairs) and employed a two-stage training strategy with QLoRA, DeepSpeed Stage 2, and Flash Attention 2. Evaluation on 12 benchmarks showed TianHui ranked top-three in all metrics for six datasets (APQ, TCMCD, HFR, HCCA, DHPE, TLAW) and achieved top results in the other six (TCMEE, APR, GCPMI, TCMKQA, TCMRC, ADTG). Optimal configuration was identified as LoRA rank=128, alpha=256, epoch=4, dropout=0.2, max length=2048. TianHui enables systematic preservation and scalable application of TCM knowledge. All resources are open-sourced.
[52] Mahānāma: A Unique Testbed for Literary Entity Discovery and Linking
Sujoy Sarkar, Gourav Sarkar, Manoj Balaji Jagadeeshan, Jivnesh Sandhan, Amrith Krishna, Pawan Goyal
Main category: cs.CL
TL;DR: Mahānāma is the first large-scale dataset for Entity Discovery and Linking (EDL) in Sanskrit, derived from the Mahābhārata epic, containing 109K entity mentions mapped to 5.5K unique entities with cross-lingual English KB alignment.
Details
Motivation: Entity resolution in literary texts is challenging due to high lexical variation, ambiguous references, and long-range dependencies. Sanskrit, as a morphologically rich and under-resourced language, lacks dedicated EDL datasets.Method: Created Mahānāma dataset from the Mahābhārata epic, comprising over 109K named entity mentions mapped to 5.5K unique entities, aligned with an English knowledge base for cross-lingual linking.
Result: Current coreference and entity linking models struggle significantly when evaluated on the global context of the test set, revealing limitations in resolving entities within complex literary discourse.
Conclusion: Mahānāma provides a unique benchmark for advancing entity resolution, especially in literary domains, highlighting the need for improved approaches to handle complex narrative structures and extensive name variation.
Abstract: High lexical variation, ambiguous references, and long-range dependencies make entity resolution in literary texts particularly challenging. We present Mah={a}n={a}ma, the first large-scale dataset for end-to-end Entity Discovery and Linking (EDL) in Sanskrit, a morphologically rich and under-resourced language. Derived from the Mah={a}bh={a}rata, the world’s longest epic, the dataset comprises over 109K named entity mentions mapped to 5.5K unique entities, and is aligned with an English knowledge base to support cross-lingual linking. The complex narrative structure of Mah={a}n={a}ma, coupled with extensive name variation and ambiguity, poses significant challenges to resolution systems. Our evaluation reveals that current coreference and entity linking models struggle when evaluated on the global context of the test set. These results highlight the limitations of current approaches in resolving entities within such complex discourse. Mah=an=ama thus provides a unique benchmark for advancing entity resolution, especially in literary domains.
[53] Benchmarking Gaslighting Attacks Against Speech Large Language Models
Jinyang Wu, Bin Zhu, Xiandong Zou, Qiquan Zhang, Xu Fang, Pan Zhou
Main category: cs.CL
TL;DR: The paper introduces gaslighting attacks on Speech LLMs - strategically crafted prompts designed to mislead and distort model reasoning, revealing significant vulnerabilities with an average 24.3% accuracy drop across tested models.
Details
Motivation: Speech LLMs are increasingly used in voice applications but their robustness against manipulative input is underexplored. Speech presents unique challenges like ambiguity and perceptual diversity that make adversarial attacks harder to detect compared to text-based systems.Method: Developed five gaslighting attack strategies (Anger, Cognitive Disruption, Sarcasm, Implicit, and Professional Negation) to test model robustness. Framework evaluates both performance degradation and behavioral responses. Conducted acoustic perturbation experiments for multi-modal assessment across 5 Speech/multi-modal LLMs on 10,000+ test samples from 5 datasets.
Result: Comprehensive evaluation revealed an average accuracy drop of 24.3% under the five gaslighting attacks, indicating significant behavioral vulnerability in Speech LLMs.
Conclusion: The findings demonstrate critical vulnerabilities in current Speech LLMs and highlight the urgent need for more resilient and trustworthy speech-based AI systems.
Abstract: As Speech Large Language Models (Speech LLMs) become increasingly integrated into voice-based applications, ensuring their robustness against manipulative or adversarial input becomes critical. Although prior work has studied adversarial attacks in text-based LLMs and vision-language models, the unique cognitive and perceptual challenges of speech-based interaction remain underexplored. In contrast, speech presents inherent ambiguity, continuity, and perceptual diversity, which make adversarial attacks more difficult to detect. In this paper, we introduce gaslighting attacks, strategically crafted prompts designed to mislead, override, or distort model reasoning as a means to evaluate the vulnerability of Speech LLMs. Specifically, we construct five manipulation strategies: Anger, Cognitive Disruption, Sarcasm, Implicit, and Professional Negation, designed to test model robustness across varied tasks. It is worth noting that our framework captures both performance degradation and behavioral responses, including unsolicited apologies and refusals, to diagnose different dimensions of susceptibility. Moreover, acoustic perturbation experiments are conducted to assess multi-modal robustness. To quantify model vulnerability, comprehensive evaluation across 5 Speech and multi-modal LLMs on over 10,000 test samples from 5 diverse datasets reveals an average accuracy drop of 24.3% under the five gaslighting attacks, indicating significant behavioral vulnerability. These findings highlight the need for more resilient and trustworthy speech-based AI systems.
[54] SINAI at eRisk@CLEF 2025: Transformer-Based and Conversational Strategies for Depression Detection
Alba Maria Marmol-Romero, Manuel Garcia-Vega, Miguel Angel Garcia-Cumbreras, Arturo Montejo-Raez
Main category: cs.CL
TL;DR: SINAI-UJA team participated in eRisk@CLEF 2025, achieving 8th place in Task 2 (Contextualized Early Depression Detection) and 1st place in Pilot Task (Conversational Depression Detection via LLMs) using transformer models and structured conversational strategies.
Details
Motivation: To develop effective methods for early detection of depression in multi-user conversations and explore the use of LLMs for conversational depression assessment in mental health contexts.Method: For Task 2: Used preprocessing pipeline with transformer models (RoBERTa Base, MentalRoBERTa Large) to capture contextual and sequential patterns. For Pilot Task: Designed conversational strategies to maximize information gain from LLM-powered personas within limited dialogue turns.
Result: Task 2: Ranked 8th/12 teams based on F1 score, but achieved fastest early predictions. Pilot Task: Ranked 1st/5 teams with best performance across all metrics (DCHR, ADODL, ASHR).
Conclusion: Demonstrated trade-off between early detection and accuracy, showing potential for joint optimization. Success in Pilot Task validates structured conversational design with LLMs for sensitive mental health assessment.
Abstract: This paper describes the participation of the SINAI-UJA team in the eRisk@CLEF 2025 lab. Specifically, we addressed two of the proposed tasks: (i) Task 2: Contextualized Early Detection of Depression, and (ii) Pilot Task: Conversational Depression Detection via LLMs. Our approach for Task 2 combines an extensive preprocessing pipeline with the use of several transformer-based models, such as RoBERTa Base or MentalRoBERTA Large, to capture the contextual and sequential nature of multi-user conversations. For the Pilot Task, we designed a set of conversational strategies to interact with LLM-powered personas, focusing on maximizing information gain within a limited number of dialogue turns. In Task 2, our system ranked 8th out of 12 participating teams based on F1 score. However, a deeper analysis revealed that our models were among the fastest in issuing early predictions, which is a critical factor in real-world deployment scenarios. This highlights the trade-off between early detection and classification accuracy, suggesting potential avenues for optimizing both jointly in future work. In the Pilot Task, we achieved 1st place out of 5 teams, obtaining the best overall performance across all evaluation metrics: DCHR, ADODL and ASHR. Our success in this task demonstrates the effectiveness of structured conversational design when combined with powerful language models, reinforcing the feasibility of deploying LLMs in sensitive mental health assessment contexts.
[55] SwissGPC v1.0 – The Swiss German Podcasts Corpus
Samuel Stucki, Mark Cieliebak, Jan Deriu
Main category: cs.CL
TL;DR: SwissGPC v1.0 is a large-scale corpus of spontaneous Swiss German speech containing ~5000 hours of annotated audio from talk shows and podcasts, covering seven major dialect regions and Standard German.
Details
Motivation: To create a resource for Swiss German speech research that captures natural, spontaneous conversations rather than controlled speech, supporting ASR, TTS, dialect identification, and related applications.Method: Corpus construction using links to Schweizer Radio und Fernsehen and YouTube content, followed by segmentation and weak annotation through an automated pipeline. The dataset includes ~5400 hours of raw audio reduced to ~5000 hours after processing.
Result: A comprehensive corpus with statistics on dialect distribution, token counts, and segmentation characteristics. Unlike existing corpora, it features natural conversations rather than controlled speech.
Conclusion: SwissGPC v1.0 represents a valuable resource for real-world speech applications due to its spontaneous nature and large-scale coverage of Swiss German dialects alongside Standard German.
Abstract: We present SwissGPC v1.0, the first mid-to-large-scale corpus of spontaneous Swiss German speech, developed to support research in ASR, TTS, dialect identification, and related fields. The dataset consists of links to talk shows and podcasts hosted on Schweizer Radio und Fernsehen and YouTube, which contain approximately 5400 hours of raw audio. After segmentation and weak annotation, nearly 5000 hours of speech were retained, covering the seven major Swiss German dialect regions alongside Standard German. We describe the corpus construction methodology, including an automated annotation pipeline, and provide statistics on dialect distribution, token counts, and segmentation characteristics. Unlike existing Swiss German speech corpora, which primarily feature controlled speech, this corpus captures natural, spontaneous conversations, making it a valuable resource for real-world speech applications.
[56] Do Before You Judge: Self-Reference as a Pathway to Better LLM Evaluation
Wei-Hsiang Lin, Sheng-Lun Wei, Hen-Hsen Huang, Hsin-Hsi Chen
Main category: cs.CL
TL;DR: This paper investigates the relationship between LLMs’ generation and judgment abilities, finding they are only weakly correlated despite relying on the same knowledge. The authors propose a self-reference-guided evaluation strategy that significantly strengthens this correlation.
Details
Motivation: LLM-as-Judge frameworks are popular for AI evaluation, but research findings on the relationship between models' generation and judgment abilities remain inconsistent. The authors aim to systematically investigate this relationship to provide more reliable evaluation methods.Method: The study conducts systematic dataset- and instance-level analyses across 11 models and 21 diverse tasks. They propose a self-reference-guided evaluation strategy that leverages a model’s own answers as references during judgment.
Result: Analyses reveal that generation and judgment abilities are only weakly correlated, primarily due to LLMs’ sensitivity to the responses being judged. The proposed self-reference approach significantly strengthens this correlation.
Conclusion: The self-reference-guided evaluation offers a practical path to align generation and judgment skills, providing a reliable proxy for model selection in evaluation tasks.
Abstract: LLM-as-Judge frameworks are increasingly popular for AI evaluation, yet research findings on the relationship between models’ generation and judgment abilities remain inconsistent. We investigate this relationship through systematic dataset- and instance-level analyses across 11 models and 21 diverse tasks. Despite both capabilities relying on the same underlying knowledge, our analyses reveal they are only weakly correlated, primarily due to LLMs' sensitivity to the responses being judged. To address this, we propose a self-reference-guided evaluation strategy that leverages a model’s own answers as references. This approach significantly strengthens the correlation between generation and judgment abilities, offering a practical path to align these skills and providing a reliable proxy for model selection in evaluation tasks.
[57] Future Policy Aware Preference Learning for Mathematical Reasoning
Minjae Oh, Yunho Choi, Dongmin Choi, Yohan Jo
Main category: cs.CL
TL;DR: FPA preference learning addresses mathematical reasoning challenges in LLMs by using future policy regularization instead of current policy regularization to prevent over-penalization of shared useful tokens during preference learning.
Details
Motivation: Standard preference learning methods like DPO fail in mathematical reasoning due to large token overlap between preferred/dispreferred trajectories, causing over-penalization of useful tokens and performance collapse.Method: FPA replaces current policy with future policy in regularization term, estimated via lightweight logit-space extrapolation from reference model to current model, enabling proactive regularization.
Result: FPA applied to DPO, RPO, and SimPER on MATH and GSM8K benchmarks yields consistent gains (up to 5.75% with SimPER), enabling longer degradation-free training with negligible overhead.
Conclusion: FPA provides effective proactive regularization that preserves useful mathematical tokens while preventing performance collapse, making preference learning viable for mathematical reasoning tasks.
Abstract: Preference learning methods such as Direct Preference Optimization (DPO) have become standard for Large Language Model (LLM) post-training, yet they are often ineffective for mathematical reasoning. A key challenge is the large token overlap between preferred and dispreferred trajectories; lowering the probability of dispreferred trajectories also reduces the probability of shared useful tokens, leading to over-penalization and overall performance collapse. As a mitigation, existing algorithms include the probability of a trajectory under the current policy as a regularization term, which decreases the effect of the gradient when the probability is low. However, by the time this effect takes hold, useful tokens may have already been over-penalized as the model has begun to degrade. To address this, we propose Future Policy Aware (FPA) preference learning, which replaces the current policy with a future policy in the regularization term. This future policy is estimated via lightweight, logit-space extrapolation from a reference model toward the current model. FPA enables safer training by preemptively regularizing potentially problematic gradients. We apply FPA to DPO, RPO, and SimPER and evaluate them on the MATH and GSM8K benchmarks. FPA yields consistent performance gains, with the largest improvements observed with SimPER, achieving gains of up to 5.75%. We demonstrate that FPA provides proactive regularization while preserving the probability of shared, useful mathematical tokens, and enables longer, degradation-free training with negligible computational overhead. We will release our code publicly upon publication.
[58] WEST: LLM based Speech Toolkit for Speech Understanding, Generation, and Interaction
Binbin Zhang, Chengdong Liang, Shuai Wang, Xuelong Geng, Zhao Guo, Haoyu Li, Hao Yin, Xipeng Yang, Pengshen Zhang, Changwei Ma, Lei Xie
Main category: cs.CL
TL;DR: WEST is a speech toolkit based on large language models that supports speech understanding, generation, and interaction with full LLM-based architecture, full-stack capabilities, and simple usability.
Details
Motivation: To create a comprehensive speech toolkit that leverages mature LLM architectures and ecosystems while being accessible to everyone, providing both reproducible baseline systems and high-performance ready-to-use models.Method: Uses large language model architecture with sequence packing, supports recognition, synthesis, understanding, dialogue, and multimodal capabilities, with extensibility to incorporate open-source models. Provides two recipe types: one fully reproducible with open-source data/models, and one trained on massive data for superior performance.
Result: WEST offers a complete speech toolkit with both baseline systems for verification and high-performance models ready for deployment, publicly available on GitHub.
Conclusion: WEST successfully creates an accessible, full-featured speech toolkit that bridges the gap between research reproducibility and practical application, making advanced speech technology available to a wider audience.
Abstract: In this paper, we present WEST(WE Speech Toolkit), a speech toolkit based on a large language model (LLM) for speech understanding, generation, and interaction. There are three key features of WEST: 1) Fully LLM-based: Standing on the shoulders of giants by reusing mature architectures, ecosystems (e.g., Hugging Face), and methods (e.g., sequence packing) from large models. 2) Full-stack: Supports tasks such as recognition, synthesis, understanding, dialogue, and multimodal capabilities, with extensibility to incorporate open-source models. 3) Simple and Stupid: A simple and stupid speech toolkit that everyone can Touch. In addition, WEST provides two types of recipes, models, and experimental results. The first is entirely based on open-source models and open-source data, allowing users to fully reproduce the experiments in this paper and serving as a verification system or minimal system baseline. The second is trained on massive data, offering superior performance so the user can directly apply it out of the box. WEST is publicly avilable at https://github.com/wenet-e2e/west/
[59] CorIL: Towards Enriching Indian Language to Indian Language Parallel Corpora and Machine Translation Systems
Soham Bhattacharjee, Mukund K Roy, Yathish Poojary, Bhargav Dave, Mihir Raj, Vandan Mujadia, Baban Gain, Pruthwik Mishra, Arafat Ahsan, Parameswari Krishnamurthy, Ashwath Rao, Gurpreet Singh Josan, Preeti Dubey, Aadil Amin Kak, Anna Rao Kulkarni, Narendra VG, Sunita Arora, Rakesh Balbantray, Prasenjit Majumdar, Karunesh K Arora, Asif Ekbal, Dipti Mishra Sharma
Main category: cs.CL
TL;DR: This paper introduces CorIL, a large-scale annotated parallel corpus covering 11 Indian languages with 772,000 sentence pairs across Government, Health, and General domains, and benchmarks state-of-the-art NMT models to establish performance trends.
Details
Motivation: High-quality parallel corpora for Indian languages remain scarce despite linguistic diversity, hindering progress in multilingual neural machine translation research and applications.Method: Created a carefully curated parallel corpus (CorIL) with 772,000 bi-text sentence pairs across 11 Indian languages, categorized into three domains. Fine-tuned and evaluated several SOTA NMT models including IndicTrans2, NLLB, and BhashaVerse to establish benchmarks.
Result: Analysis revealed distinct performance patterns based on language script - massively multilingual models showed advantage on Perso-Arabic scripts (Urdu, Sindhi) while other models excelled on Indic scripts. Provided detailed domain-wise performance analysis showing domain sensitivity and cross-script transfer learning capabilities.
Conclusion: CorIL significantly improves availability of high-quality training data for Indian languages and serves as a valuable resource for machine translation research, enabling better domain adaptation and cross-lingual transfer learning.
Abstract: India’s linguistic landscape is one of the most diverse in the world, comprising over 120 major languages and approximately 1,600 additional languages, with 22 officially recognized as scheduled languages in the Indian Constitution. Despite recent progress in multilingual neural machine translation (NMT), high-quality parallel corpora for Indian languages remain scarce, especially across varied domains. In this paper, we introduce a large-scale, high-quality annotated parallel corpus covering 11 of these languages : English, Telugu, Hindi, Punjabi, Odia, Kashmiri, Sindhi, Dogri, Kannada, Urdu, and Gujarati comprising a total of 772,000 bi-text sentence pairs. The dataset is carefully curated and systematically categorized into three key domains: Government, Health, and General, to enable domain-aware machine translation research and facilitate effective domain adaptation. To demonstrate the utility of CorIL and establish strong benchmarks for future research, we fine-tune and evaluate several state-of-the-art NMT models, including IndicTrans2, NLLB, and BhashaVerse. Our analysis reveals important performance trends and highlights the corpus’s value in probing model capabilities. For instance, the results show distinct performance patterns based on language script, with massively multilingual models showing an advantage on Perso-Arabic scripts (Urdu, Sindhi) while other models excel on Indic scripts. This paper provides a detailed domain-wise performance analysis, offering insights into domain sensitivity and cross-script transfer learning. By publicly releasing CorIL, we aim to significantly improve the availability of high-quality training data for Indian languages and provide a valuable resource for the machine translation research community.
[60] The Knowledge-Behaviour Disconnect in LLM-based Chatbots
Jan Broersen
Main category: cs.CL
TL;DR: LLMs exhibit a fundamental ‘disconnect’ between their knowledge and conversational behavior that cannot be resolved through more data or training, explaining hallucinations and ethical misalignments.
Details
Motivation: To analyze whether large language models genuinely use their knowledge as a basis for conversational behavior, and to identify fundamental limitations in their training methodology.Method: Philosophical argumentation analyzing the core training techniques of LLMs and examining why they inherently create a disconnect between knowledge representation and behavioral application.
Result: Identifies a fundamental disconnect that persists regardless of training scale, explains hallucinations as a consequence of this disconnect, and shows that ethical alignment techniques fail to address the core issue.
Conclusion: The disconnect represents an inherent limitation in LLM architecture that cannot be overcome through current training methods, requiring fundamentally different approaches to achieve genuine knowledge-based behavior.
Abstract: Large language model-based artificial conversational agents (like ChatGPT) give answers to all kinds of questions, and often enough these answers are correct. Just on the basis of that capacity alone, we may attribute knowledge to them. But do these models use this knowledge as a basis for their own conversational behaviour? I argue this is not the case, and I will refer to this failure as a `disconnect’. I further argue this disconnect is fundamental in the sense that with more data and more training of the LLM on which a conversational chatbot is based, it will not disappear. The reason is, as I will claim, that the core technique used to train LLMs does not allow for the establishment of the connection we are after. The disconnect reflects a fundamental limitation on the capacities of LLMs, and explains the source of hallucinations. I will furthermore consider the ethical version of the disconnect (ethical conversational knowledge not being aligned with ethical conversational behaviour), since in this domain researchers have come up with several additional techniques to influence a chatbot’s behaviour. I will discuss how these techniques do nothing to solve the disconnect and can make it worse.
[61] DiffNator: Generating Structured Explanations of Time-Series Differences
Kota Dohi, Tomoya Nishida, Harsh Purohit, Takashi Endo, Yohei Kawaguchi
Main category: cs.CL
TL;DR: DiffNator is a framework that generates structured explanations for differences between two time series using a JSON schema and a model combining time-series encoding with frozen LLM.
Details
Motivation: In IoT applications, interpreting differences between sensor signals requires expert knowledge, creating a need for automated structured explanations.Method: Design a JSON schema to capture difference properties, use TORI dataset to generate paired sequences, and train a model combining time-series encoder with frozen LLM to output JSON-formatted explanations.
Result: DiffNator generates accurate difference explanations and significantly outperforms both visual question answering baseline and retrieval method using pre-trained time-series encoder.
Conclusion: The proposed framework successfully provides structured explanations for time-series differences, demonstrating superior performance over alternative methods.
Abstract: In many IoT applications, the central interest lies not in individual sensor signals but in their differences, yet interpreting such differences requires expert knowledge. We propose DiffNator, a framework for structured explanations of differences between two time series. We first design a JSON schema that captures the essential properties of such differences. Using the Time-series Observations of Real-world IoT (TORI) dataset, we generate paired sequences and train a model that combine a time-series encoder with a frozen LLM to output JSON-formatted explanations. Experimental results show that DiffNator generates accurate difference explanations and substantially outperforms both a visual question answering (VQA) baseline and a retrieval method using a pre-trained time-series encoder.
[62] Tokenization and Representation Biases in Multilingual Models on Dialectal NLP Tasks
Vani Kanjirangat, Tanja Samardžić, Ljiljana Dolamic, Fabio Rinaldi
Main category: cs.CL
TL;DR: This paper investigates how Tokenization Parity (TP) and Information Parity (IP) predict model performance on dialectal data across different tasks, revealing TP is better for syntax-dependent tasks while IP predicts semantic task performance.
Details
Motivation: Dialectal data shows small linguistic variations that significantly impact model performance, but existing explanations (data size, economic/social factors) are inconsistent. The study aims to directly correlate representational biases (TP and IP) with downstream performance.Method: Compare state-of-the-art decoder-only LLMs with encoder-based models across three tasks (dialect classification, topic classification, extractive QA), controlling for scripts (Latin vs. non-Latin) and resource availability (high vs. low). Analyze tokenizer behavior, vocabulary coverage, and provide qualitative insights.
Result: TP is a better predictor for syntax/morphology-dependent tasks (e.g., extractive QA), while IP better predicts performance in semantic tasks (e.g., topic classification). Language support claims of LLMs often mask deeper mismatches at script or token level.
Conclusion: Tokenization Parity and Information Parity are effective measures for understanding model performance on dialectal data, with task-specific predictive value. Current language support claims in LLMs may be misleading due to underlying representational biases.
Abstract: Dialectal data are characterized by linguistic variation that appears small to humans but has a significant impact on the performance of models. This dialect gap has been related to various factors (e.g., data size, economic and social factors) whose impact, however, turns out to be inconsistent. In this work, we investigate factors impacting the model performance more directly: we correlate Tokenization Parity (TP) and Information Parity (IP), as measures of representational biases in pre-trained multilingual models, with the downstream performance. We compare state-of-the-art decoder-only LLMs with encoder-based models across three tasks: dialect classification, topic classification, and extractive question answering, controlling for varying scripts (Latin vs. non-Latin) and resource availability (high vs. low). Our analysis reveals that TP is a better predictor of the performance on tasks reliant on syntactic and morphological cues (e.g., extractive QA), while IP better predicts performance in semantic tasks (e.g., topic classification). Complementary analyses, including tokenizer behavior, vocabulary coverage, and qualitative insights, reveal that the language support claims of LLMs often might mask deeper mismatches at the script or token level.
[63] Responsible AI Technical Report
KT, :, Soonmin Bae, Wanjin Park, Jeongyeop Kim, Yunjin Park, Jungwon Yoon, Junhyung Moon, Myunggyo Oh, Wonhyuk Lee, Junseo Jang, Dongyoung Jung, Minwook Ju, Eunmi Kim, Sujin Kim, Youngchol Kim, Somin Lee, Wonyoung Lee, Minsung Noh, Hyoungjun Park, Eunyoung Shin
Main category: cs.CL
TL;DR: KT developed a Responsible AI assessment methodology and risk mitigation technologies including SafetyGuard tool to ensure AI service safety and regulatory compliance.
Details
Motivation: To address AI safety and reliability concerns by establishing regulatory compliance approaches and systematically managing AI risks from development to operation.Method: Developed a Responsible AI assessment methodology based on AI risk taxonomy, analyzed Basic Act on AI implementation and global governance trends, and created practical tools including SafetyGuard for real-time harmful response blocking.
Result: Established a unique regulatory compliance approach, systematic risk identification methodology, and released proprietary Guardrail SafetyGuard tool for real-time AI risk mitigation.
Conclusion: The research outcomes provide valuable insights for organizations developing Responsible AI and enhance safety in the domestic AI development ecosystem.
Abstract: KT developed a Responsible AI (RAI) assessment methodology and risk mitigation technologies to ensure the safety and reliability of AI services. By analyzing the Basic Act on AI implementation and global AI governance trends, we established a unique approach for regulatory compliance and systematically identify and manage all potential risk factors from AI development to operation. We present a reliable assessment methodology that systematically verifies model safety and robustness based on KT’s AI risk taxonomy tailored to the domestic environment. We also provide practical tools for managing and mitigating identified AI risks. With the release of this report, we also release proprietary Guardrail : SafetyGuard that blocks harmful responses from AI models in real-time, supporting the enhancement of safety in the domestic AI development ecosystem. We also believe these research outcomes provide valuable insights for organizations seeking to develop Responsible AI.
[64] From Input Perception to Predictive Insight: Modeling Model Blind Spots Before They Become Errors
Maggie Mi, Aline Villavicencio, Nafise Sadat Moosavi
Main category: cs.CL
TL;DR: Input-only method using token-level likelihood features to predict language model failures on idiomatic/figurative inputs, outperforming baselines across five challenging datasets.
Details
Motivation: Language models often misinterpret idiomatic, figurative, or context-sensitive inputs from the outset, leading to failures not due to flawed outputs but initial comprehension errors.Method: Proposes input-only failure prediction using token-level likelihood features inspired by surprisal and Uniform Information Density hypothesis, capturing localized uncertainty in input comprehension without needing outputs or hidden activations.
Result: Method outperforms standard baselines across five linguistically challenging datasets. Span-localized features improve error detection for larger models, while smaller models benefit from global patterns.
Conclusion: Offers a lightweight, generalizable approach to pre-generation error prediction that requires no access to model outputs or hidden activations.
Abstract: Language models often struggle with idiomatic, figurative, or context-sensitive inputs, not because they produce flawed outputs, but because they misinterpret the input from the outset. We propose an input-only method for anticipating such failures using token-level likelihood features inspired by surprisal and the Uniform Information Density hypothesis. These features capture localized uncertainty in input comprehension and outperform standard baselines across five linguistically challenging datasets. We show that span-localized features improve error detection for larger models, while smaller models benefit from global patterns. Our method requires no access to outputs or hidden activations, offering a lightweight and generalizable approach to pre-generation error prediction.
[65] From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training
Tianqiao Liu, Xueyi Li, Hao Wang, Haoxuan Li, Zhichao Chen, Weiqi Luo, Zitao Liu
Main category: cs.CL
TL;DR: TtT is a unified audio-text modeling framework that combines autoregressive text generation with non-autoregressive audio diffusion in a single Transformer architecture, addressing the asymmetry between text and audio token dependencies.
Details
Motivation: Existing multimodal models for speech-in speech-out systems require complex multi-stage training and uniformly apply autoregressive generation to both text and audio tokens, ignoring the fundamental asymmetry in their dependency structures where text tokens need causal ordering while audio tokens are driven by source-target dependencies.Method: Proposes TtT framework that integrates AR text generation with non-autoregressive audio diffusion within a single Transformer architecture initialized from a pretrained LLM, allowing different generation strategies for text and audio tokens.
Result: The paper presents a unified approach that avoids complex multi-stage training pipelines and more accurately models the different dependency structures of text and audio tokens.
Conclusion: TtT provides an efficient and unified solution for multimodal speech-in speech-out systems by properly handling the asymmetric nature of text and audio token dependencies through combined autoregressive and non-autoregressive generation strategies.
Abstract: Recent advances in large language models have attracted significant interest in extending their capabilities to multimodal scenarios, particularly for speech-in speech-out conversational systems. However, existing multimodal models handling interleaved audio and text, such as MOSHI require complex multi stage training pipelines, incurring substantial computational costs. Moreover, these models uniformly apply autoregressive generation to both text and audio tokens, overlooking a fundamental asymmetry in their dependency structures: while text tokens exhibit strong target target dependencies requiring causal ordering, audio tokens are predominantly driven by source target dependencies, where audio outputs primarily condition on source text rather than preceding audio tokens. In this work, we propose TtT, a unified audio-text modeling framework that integrates AR text generation with non-autoregressive audio diffusion within a single Transformer architecture initialized from a pretrained LLM.
[66] Can Constructions “SCAN” Compositionality ?
Ganesh Katrapati, Manish Shrivastava
Main category: cs.CL
TL;DR: Unsupervised mining of pseudo-constructions from training data improves sequence-to-sequence models’ compositionality and systematic generalization on SCAN dataset without architectural changes.
Details
Motivation: Sequence-to-sequence models struggle with compositionality and systematic generalization due to failure to internalize conventionalized form-meaning pairings that enable productive recombination.Method: Introduce an unsupervised procedure for mining pseudo-constructions (variable-slot templates) automatically extracted from training data, applied to the SCAN dataset.
Result: Large gains on out-of-distribution splits: 47.8% accuracy on ADD JUMP and 20.3% on AROUND RIGHT. Competitive performance with only 40% of original training data, demonstrating strong data efficiency.
Conclusion: Construction-aware preprocessing shows promise as an alternative to heavy architectural or training-regime interventions for improving systematic generalization.
Abstract: Sequence to Sequence models struggle at compositionality and systematic generalisation even while they excel at many other tasks. We attribute this limitation to their failure to internalise constructions conventionalised form meaning pairings that license productive recombination. Building on these insights, we introduce an unsupervised procedure for mining pseudo-constructions: variable-slot templates automatically extracted from training data. When applied to the SCAN dataset, our method yields large gains out-of-distribution splits: accuracy rises to 47.8 %on ADD JUMP and to 20.3% on AROUND RIGHT without any architectural changes or additional supervision. The model also attains competitive performance with? 40% of the original training data, demonstrating strong data efAciency. Our findings highlight the promise of construction-aware preprocessing as an alternative to heavy architectural or training-regime interventions.
[67] OLaPh: Optimal Language Phonemizer
Johannes Wirth
Main category: cs.CL
TL;DR: OLaPh is a phonemization framework combining lexica, NLP techniques, and probabilistic scoring, enhanced by an LLM trained on its outputs for improved accuracy in German and English.
Details
Motivation: Traditional phonemization systems struggle with names, loanwords, abbreviations, and homographs, limiting their accuracy and generalization.Method: Combines large lexica, multiple NLP techniques, compound resolution with probabilistic scoring, and trains an LLM on OLaPh-generated data.
Result: Shows improved accuracy over previous approaches in German and English, including on challenging datasets, with the LLM providing even stronger generalization.
Conclusion: The framework and LLM together improve phonemization consistency and provide a freely available resource for future research.
Abstract: Phonemization, the conversion of text into phonemes, is a key step in text-to-speech. Traditional approaches use rule-based transformations and lexicon lookups, while more advanced methods apply preprocessing techniques or neural networks for improved accuracy on out-of-domain vocabulary. However, all systems struggle with names, loanwords, abbreviations, and homographs. This work presents OLaPh (Optimal Language Phonemizer), a framework that combines large lexica, multiple NLP techniques, and compound resolution with a probabilistic scoring function. Evaluations in German and English show improved accuracy over previous approaches, including on a challenging dataset. To further address unresolved cases, we train a large language model on OLaPh-generated data, which achieves even stronger generalization and performance. Together, the framework and LLM improve phonemization consistency and provide a freely available resource for future research.
[68] Causal Understanding by LLMs: The Role of Uncertainty
Oscar Lithgow-Serrano, Vani Kanjirangat, Alessandro Antonucci
Main category: cs.CL
TL;DR: LLMs show near-random accuracy in causal relation classification despite pretraining exposure, suggesting failures stem from lack of structured causal representation rather than insufficient exposure.
Details
Motivation: To investigate whether LLMs' poor performance in causal relation classification arises from limited pretraining exposure or deeper representational gaps in causal understanding.Method: Evaluated 7 models on 18K PubMed sentences using uncertainty-based evaluation with causal classification (4-way: direct/conditional/correlational/no-relationship) and verbatim memorization probing (original vs. paraphrase selection).
Result: Models showed identical accuracy on seen/unseen sentences, no memorization bias (24.8% original selection), flat output distributions with near-maximum entropy (1.35/1.39), and instruction-tuned models exhibited severe miscalibration (Qwen: 95% confidence but 32.8% accuracy).
Conclusion: Failures in causal understanding arise from lack of structured causal representation, not insufficient exposure to causal examples during pretraining.
Abstract: Recent papers show LLMs achieve near-random accuracy in causal relation classification, raising questions about whether such failures arise from limited pretraining exposure or deeper representational gaps. We investigate this under uncertainty-based evaluation, testing whether pretraining exposure to causal examples improves causal understanding >18K PubMed sentences – half from The Pile corpus, half post-2024 – across seven models (Pythia-1.4B/7B/12B, GPT-J-6B, Dolly-7B/12B, Qwen-7B). We analyze model behavior through: (i) causal classification, where the model identifies causal relationships in text, and (ii) verbatim memorization probing, where we assess whether the model prefers previously seen causal statements over their paraphrases. Models perform four-way classification (direct/conditional/correlational/no-relationship) and select between originals and their generated paraphrases. Results show almost identical accuracy on seen/unseen sentences (p > 0.05), no memorization bias (24.8% original selection), and output distribution over the possible options is almost flat, with entropic values near the maximum (1.35/1.39), confirming random guessing. Instruction-tuned models show severe miscalibration (Qwen: > 95% confidence, 32.8% accuracy, ECE=0.49). Conditional relations induce highest entropy (+11% vs. direct). These findings suggest that failures in causal understanding arise from the lack of structured causal representation, rather than insufficient exposure to causal examples during pretraining.
[69] Integrated Framework for LLM Evaluation with Answer Generation
Sujeong Lee, Hayoung Lee, Seongsoo Heo, Wonik Choi
Main category: cs.CL
TL;DR: SPEED is a novel evaluation framework that uses specialized functional experts for comprehensive descriptive analysis of LLM outputs, addressing limitations of traditional benchmark-based methods.
Details
Motivation: Traditional benchmark-based evaluation methods rely on fixed reference answers and fail to capture important qualitative aspects of generated responses, limiting their reliability for practical applications.Method: SPEED employs an integrated framework with specialized functional experts that perform multi-dimensional descriptive analyses, including hallucination detection, toxicity assessment, and lexical-contextual appropriateness, while actively incorporating expert feedback.
Result: Experimental results show SPEED achieves robust and consistent evaluation performance across diverse domains and datasets, with superior resource efficiency using relatively compact expert models compared to larger-scale evaluators.
Conclusion: SPEED significantly enhances fairness and interpretability in LLM evaluations, offering a promising alternative to existing evaluation methodologies by providing more comprehensive and practical assessment capabilities.
Abstract: Reliable evaluation of large language models is essential to ensure their applicability in practical scenarios. Traditional benchmark-based evaluation methods often rely on fixed reference answers, limiting their ability to capture important qualitative aspects of generated responses. To address these shortcomings, we propose an integrated evaluation framework called \textit{self-refining descriptive evaluation with expert-driven diagnostics}, SPEED, which utilizes specialized functional experts to perform comprehensive, descriptive analyses of model outputs. Unlike conventional approaches, SPEED actively incorporates expert feedback across multiple dimensions, including hallucination detection, toxicity assessment, and lexical-contextual appropriateness. Experimental results demonstrate that SPEED achieves robust and consistent evaluation performance across diverse domains and datasets. Additionally, by employing relatively compact expert models, SPEED demonstrates superior resource efficiency compared to larger-scale evaluators. These findings illustrate that SPEED significantly enhances fairness and interpretability in LLM evaluations, offering a promising alternative to existing evaluation methodologies.
[70] Less is More: The Effectiveness of Compact Typological Language Representations
York Hay Ng, Phuong Hanh Hoang, En-Shiun Annie Lee
Main category: cs.CL
TL;DR: Proposes a pipeline to optimize URIEL+ typological feature space through feature selection and imputation to create compact, interpretable representations that improve linguistic distance metrics and multilingual NLP performance.
Details
Motivation: High dimensionality and sparsity in linguistic feature datasets like URIEL+ limit the effectiveness of distance metrics, especially for low-resource languages.Method: Combines feature selection and imputation techniques to optimize the URIEL+ typological feature space, producing reduced-size yet interpretable typological representations.
Result: The optimized feature subsets demonstrate improved linguistic distance alignment and better performance in downstream multilingual NLP applications.
Conclusion: Reduced-size representations of language typology can yield more informative distance metrics and enhance performance in multilingual NLP tasks.
Abstract: Linguistic feature datasets such as URIEL+ are valuable for modelling cross-lingual relationships, but their high dimensionality and sparsity, especially for low-resource languages, limit the effectiveness of distance metrics. We propose a pipeline to optimize the URIEL+ typological feature space by combining feature selection and imputation, producing compact yet interpretable typological representations. We evaluate these feature subsets on linguistic distance alignment and downstream tasks, demonstrating that reduced-size representations of language typology can yield more informative distance metrics and improve performance in multilingual NLP applications.
[71] Embedding Domain Knowledge for Large Language Models via Reinforcement Learning from Augmented Generation
Chaojun Nie, Jun Zhou, Guanxiang Wang, Shisong Wud, Zichen Wang
Main category: cs.CL
TL;DR: RLAG (Reinforcement Learning from Augmented Generation) addresses knowledge gaps in LLMs for domain-specific tasks by iteratively optimizing models through reward-based generation sampling, outperforming traditional CPT and SFT methods.
Details
Motivation: LLMs struggle with domain-specific tasks due to knowledge scarcity and temporal lag in training data. Existing methods like CPT and SFT have limitations in prioritizing critical knowledge and developing coherent knowledge structures for complex reasoning.Method: Proposes RLAG which iteratively cycles between sampling generations and optimizing models through calculated rewards. Uses highest log probability outputs with three tailored reward metrics to embed critical domain knowledge.
Result: Experimental results across medical, legal, astronomy, and current events datasets show RLAG significantly outperforms baseline approaches in both answer accuracy and explanation rationality.
Conclusion: RLAG effectively addresses knowledge gaps in domain-specific applications by embedding contextually coherent domain knowledge through reinforcement learning from augmented generation.
Abstract: Large language models (LLMs) often exhibit limited performance on domain-specific tasks due to the natural disproportionate representation of specialized information in their training data and the static nature of these datasets. Knowledge scarcity and temporal lag create knowledge gaps for domain applications. While post-training on domain datasets can embed knowledge into models, existing approaches have some limitations. Continual Pre-Training (CPT) treats all tokens in domain documents with equal importance, failing to prioritize critical knowledge points, while supervised fine-tuning (SFT) with question-answer pairs struggles to develop the coherent knowledge structures necessary for complex reasoning tasks. To address these challenges, we propose Reinforcement Learning from Augmented Generation (RLAG). Our approach iteratively cycles between sampling generations and optimizing the model through calculated rewards, effectively embedding critical and contextually coherent domain knowledge. We select generated outputs with the highest log probabilities as the sampling result, then compute three tailored reward metrics to guide the optimization process. To comprehensively evaluate domain expertise, we assess answer accuracy and the rationality of explanations generated for correctly answered questions. Experimental results across medical, legal, astronomy, and current events datasets demonstrate that our proposed method significantly outperforms baseline approaches. Our code and data are open sourced at https://github.com/ChaojunNie/RLAG.
[72] Probing Gender Bias in Multilingual LLMs: A Case Study of Stereotypes in Persian
Ghazal Kalhor, Behnam Bahrak
Main category: cs.CL
TL;DR: This paper proposes a template-based probing methodology to uncover gender stereotypes in multilingual LLMs, focusing on Persian as a low-resource language, and introduces the Domain-Specific Gender Skew Index (DS-GSI) metric to quantify bias.
Details
Motivation: Multilingual LLMs are widely used but gender bias in low-resource languages remains understudied, particularly for languages like Persian with distinct linguistic features, creating a need to prevent representational harm.Method: Template-based probing methodology validated against real-world data, using DS-GSI metric to quantify gender bias deviations. Evaluated four models (GPT-4o mini, DeepSeek R1, Gemini 2.0 Flash, Qwen QwQ 32B) across four semantic domains.
Result: All models exhibited gender stereotypes, with greater disparities in Persian than English across all domains. Sports domain showed the most rigid gender biases.
Conclusion: The study underscores the need for inclusive NLP practices and provides a framework for assessing bias in other low-resource languages.
Abstract: Multilingual Large Language Models (LLMs) are increasingly used worldwide, making it essential to ensure they are free from gender bias to prevent representational harm. While prior studies have examined such biases in high-resource languages, low-resource languages remain understudied. In this paper, we propose a template-based probing methodology, validated against real-world data, to uncover gender stereotypes in LLMs. As part of this framework, we introduce the Domain-Specific Gender Skew Index (DS-GSI), a metric that quantifies deviations from gender parity. We evaluate four prominent models, GPT-4o mini, DeepSeek R1, Gemini 2.0 Flash, and Qwen QwQ 32B, across four semantic domains, focusing on Persian, a low-resource language with distinct linguistic features. Our results show that all models exhibit gender stereotypes, with greater disparities in Persian than in English across all domains. Among these, sports reflect the most rigid gender biases. This study underscores the need for inclusive NLP practices and provides a framework for assessing bias in other low-resource languages.
[73] Thinking Augmented Pre-training
Liang Wang, Nan Yang, Shaohan Huang, Li Dong, Furu Wei
Main category: cs.CL
TL;DR: TPT improves LLM training data efficiency by 3x through augmenting text with automatically generated thinking trajectories, enhancing learnability of complex tokens via step-by-step reasoning.
Details
Motivation: Address the limitation of high-quality data scarcity and the difficulty of learning complex tokens with fixed model capacity, as compute for pre-training grows rapidly while quality data remains limited.Method: Thinking augmented Pre-Training (TPT) - a universal methodology that augments text data with automatically generated thinking trajectories, increasing training volume and making high-quality tokens more learnable through decomposition.
Result: Substantial performance improvements across various model sizes and families; 3x data efficiency improvement; 10%+ performance gain on challenging reasoning benchmarks for 3B parameter models; validated across diverse training configurations up to 100B tokens.
Conclusion: TPT is an effective and scalable approach that significantly enhances LLM training data efficiency and reasoning capabilities through thinking trajectory augmentation.
Abstract: This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for pre-training LLMs has been growing at an unprecedented rate, while the availability of high-quality data remains limited. Consequently, maximizing the utility of available data constitutes a significant research challenge. A primary impediment is that certain high-quality tokens are difficult to learn given a fixed model capacity, as the underlying rationale for a single token can be exceptionally complex and deep. To address this issue, we propose Thinking augmented Pre-Training (TPT), a universal methodology that augments text with automatically generated thinking trajectories. Such augmentation effectively increases the volume of the training data and makes high-quality tokens more learnable through step-by-step reasoning and decomposition. We apply TPT across diverse training configurations up to $100$B tokens, encompassing pre-training with both constrained and abundant data, as well as mid-training from strong open-source checkpoints. Experimental results indicate that our method substantially improves the performance of LLMs across various model sizes and families. Notably, TPT enhances the data efficiency of LLM pre-training by a factor of $3$. For a $3$B parameter model, it improves the post-training performance by over $10%$ on several challenging reasoning benchmarks.
[74] Play by the Type Rules: Inferring Constraints for LLM Functions in Declarative Programs
Parker Glenn, Alfy Samuel, Daben Liu
Main category: cs.CL
TL;DR: This paper proposes an efficient solution for integrating LLM-powered operators in SQL-like declarative query languages, addressing the challenge of ensuring generated outputs align with database types and contents without performance bottlenecks.
Details
Motivation: Current approaches use multiple LLM-based post-processing calls to ensure alignment between LLM-generated outputs and database values, which creates performance bottlenecks. The authors aim to enable efficient execution of LLM functions within query languages while maintaining type safety and database alignment.Method: The authors conducted a study on various sized open-source language models’ ability to parse and execute functions within SQL-based query languages. They then proposed a solution to enforce well-typedness of LLM functions, showing that small language models can effectively function as executors over hybrid data sources.
Result: The proposed solution demonstrated 7% accuracy improvement on a multi-hop question answering dataset with 53% improvement in latency over comparable solutions.
Conclusion: Small language models can excel as function executors in declarative query languages, and the proposed efficient type enforcement solution significantly improves both accuracy and latency compared to existing approaches for integrating LLM operators in database query languages.
Abstract: Integrating LLM powered operators in declarative query languages allows for the combination of cheap and interpretable functions with powerful, generalizable language model reasoning. However, in order to benefit from the optimized execution of a database query language like SQL, generated outputs must align with the rules enforced by both type checkers and database contents. Current approaches address this challenge with orchestrations consisting of many LLM-based post-processing calls to ensure alignment between generated outputs and database values, introducing performance bottlenecks. We perform a study on the ability of various sized open-source language models to both parse and execute functions within a query language based on SQL, showing that small language models can excel as function executors over hybrid data sources. Then, we propose an efficient solution to enforce the well-typedness of LLM functions, demonstrating 7% accuracy improvement on a multi-hop question answering dataset with 53% improvement in latency over comparable solutions. We make our implementation available at https://github.com/parkervg/blendsql
[75] Low-Resource English-Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks
Hailay Kidu Teklehaymanot, Gebrearegawi Gidey, Wolfgang Nejdl
Main category: cs.CL
TL;DR: This paper proposes transfer learning techniques with custom tokenization to improve Neural Machine Translation for low-resource languages like Tigrinya, achieving significant performance gains over baseline methods.
Details
Motivation: Low-resource languages like Tigrinya face challenges in NMT due to limited corpora, inadequate tokenization strategies, and lack of standardized evaluation benchmarks. The paper aims to bridge this performance gap for morphologically rich, underrepresented languages.Method: The approach integrates language-specific tokenization, informed embedding initialization, and domain-adaptive fine-tuning using multilingual pretrained models. A high-quality human-aligned English-Tigrinya evaluation dataset is constructed for rigorous assessment.
Result: Transfer learning with custom tokenizer substantially outperforms zero-shot baselines, with gains validated by BLEU, chrF, and qualitative human evaluation. Statistical significance is confirmed using Bonferroni correction.
Conclusion: The study demonstrates the importance of linguistically aware modeling and reproducible benchmarks for improving translation quality in low-resource language scenarios. Error analysis informs targeted refinements for future work.
Abstract: Despite advances in Neural Machine Translation (NMT), low-resource languages like Tigrinya remain underserved due to persistent challenges, including limited corpora, inadequate tokenization strategies, and the lack of standardized evaluation benchmarks. This paper investigates transfer learning techniques using multilingual pretrained models to enhance translation quality for morphologically rich, low-resource languages. We propose a refined approach that integrates language-specific tokenization, informed embedding initialization, and domain-adaptive fine-tuning. To enable rigorous assessment, we construct a high-quality, human-aligned English-Tigrinya evaluation dataset covering diverse domains. Experimental results demonstrate that transfer learning with a custom tokenizer substantially outperforms zero-shot baselines, with gains validated by BLEU, chrF, and qualitative human evaluation. Bonferroni correction is applied to ensure statistical significance across configurations. Error analysis reveals key limitations and informs targeted refinements. This study underscores the importance of linguistically aware modeling and reproducible benchmarks in bridging the performance gap for underrepresented languages. Resources are available at https://github.com/hailaykidu/MachineT_TigEng and https://huggingface.co/Hailay/MachineT_TigEng
[76] Investigating the Representation of Backchannels and Fillers in Fine-tuned Language Models
Yu Wang, Leyi Lao, Langchu Huang, Gabriel Skantze, Yang Xu, Hendrik Buschmeier
Main category: cs.CL
TL;DR: This paper studies how fine-tuning strategies can help transformer-based language models better represent backchannels and fillers in dialogue systems, showing improved semantic distinction and more human-like language generation.
Details
Motivation: Backchannels and fillers are important linguistic expressions in dialogue but are under-represented in modern transformer-based language models. The research aims to improve LMs' ability to handle these conversational elements.Method: Three fine-tuning strategies applied to language models trained on three dialogue corpora in English and Japanese. Used clustering analysis and natural language generation metrics to evaluate the learnt representations.
Result: Fine-tuned models showed increased silhouette scores in clustering analysis, indicating better semantic distinction of backchannels and fillers. NLG metrics confirmed generated utterances more closely resembled human-produced language.
Conclusion: Fine-tuning enables language models to better distinguish nuanced semantic variations in backchannels and fillers, suggesting potential for transforming general LMs into more capable conversational LMs that produce more human-like language.
Abstract: Backchannels and fillers are important linguistic expressions in dialogue, but are under-represented in modern transformer-based language models (LMs). Our work studies the representation of them in language models using three fine-tuning strategies. The models are trained on three dialogue corpora in English and Japanese, where backchannels and fillers are preserved and annotated, to investigate how fine-tuning can help LMs learn their representations. We first apply clustering analysis to the learnt representation of backchannels and fillers, and have found increased silhouette scores in representations from fine-tuned models, which suggests that fine-tuning enables LMs to distinguish the nuanced semantic variation in different backchannel and filler use. We also use natural language generation (NLG) metrics to confirm that the utterances generated by fine-tuned language models resemble human-produced utterances more closely. Our findings suggest the potentials of transforming general LMs into conversational LMs that are more capable of producing human-like languages adequately.
[77] Instruction Boundary: Quantifying Biases in LLM Reasoning under Various Coverage
Zipeng Ling, Yuehao Tang, Chen Huang, Shuliang Liu, Gaoyang Jiang, Shenghong Fu, Junqi Yang, Yao Wan, Jiawan Zhang, Kejia Huang, Xuming Hu
Main category: cs.CL
TL;DR: The paper identifies the ‘Instruction Boundary’ vulnerability in LLMs where biased or incomplete prompts lead to unreliable reasoning, introduces BiasDetector framework to measure biases from three instruction types, and finds substantial biases persist despite high headline accuracy.
Details
Motivation: LLM reasoning is powerful but limitations from prompt design remain underexplored. Users may unintentionally supply biased or incomplete prompts, misleading LLMs and creating reliability risks.Method: Distilled the Instruction Boundary into eight facets and introduced BiasDetector framework to measure biases arising from complete, redundant, and insufficient instruction types. Evaluated several mainstream LLMs.
Result: Despite high headline accuracy, substantial biases persist in many downstream tasks as a direct consequence of prompt coverage. LLM reasoning reliability can still be significantly improved.
Conclusion: Findings underscore the need for developers to tackle biases and for users to craft prompts carefully. The paper analyzes practical impact and outlines mitigation strategies.
Abstract: Large-language-model (LLM) reasoning has long been regarded as a powerful tool for problem solving across domains, providing non-experts with valuable advice. However, their limitations - especially those stemming from prompt design - remain underexplored. Because users may supply biased or incomplete prompts - often unintentionally - LLMs can be misled, undermining reliability and creating risks. We refer to this vulnerability as the Instruction Boundary. To investigate the phenomenon, we distill it into eight concrete facets and introduce BiasDetector, a framework that measures biases arising from three instruction types: complete, redundant, and insufficient. We evaluate several mainstream LLMs and find that, despite high headline accuracy, substantial biases persist in many downstream tasks as a direct consequence of prompt coverage. Our empirical study confirms that LLM reasoning reliability can still be significantly improved. We analyze the practical impact of these biases and outline mitigation strategies. Our findings underscore the need for developers to tackle biases and for users to craft options carefully.
[78] Feeding Two Birds or Favoring One? Adequacy-Fluency Tradeoffs in Evaluation and Meta-Evaluation of Machine Translation
Behzad Shayegh, Jan-Thorsten Peter, David Vilar, Tobias Domhan, Juraj Juraska, Markus Freitag, Lili Mou
Main category: cs.CL
TL;DR: This paper investigates the tradeoff between adequacy and fluency in machine translation evaluation, showing current metrics favor adequacy and that WMT meta-evaluation has a bias toward adequacy-oriented metrics due to system composition.
Details
Motivation: To understand and address the bias in machine translation evaluation where current metrics and meta-evaluation frameworks favor adequacy over fluency, potentially leading to unfair metric rankings.Method: Analyzed the tradeoff at both evaluation and meta-evaluation levels, and proposed a method to synthesize translation systems in meta-evaluation to control for the bias.
Result: Found that current metrics correlate more strongly with adequacy than fluency, and WMT meta-evaluation favors adequacy-oriented metrics due to the composition of systems in datasets.
Conclusion: Highlights the importance of understanding the adequacy-fluency tradeoff in meta-evaluation and its impact on metric rankings, suggesting the need for balanced evaluation frameworks.
Abstract: We investigate the tradeoff between adequacy and fluency in machine translation. We show the severity of this tradeoff at the evaluation level and analyze where popular metrics fall within it. Essentially, current metrics generally lean toward adequacy, meaning that their scores correlate more strongly with the adequacy of translations than with fluency. More importantly, we find that this tradeoff also persists at the meta-evaluation level, and that the standard WMT meta-evaluation favors adequacy-oriented metrics over fluency-oriented ones. We show that this bias is partially attributed to the composition of the systems included in the meta-evaluation datasets. To control this bias, we propose a method that synthesizes translation systems in meta-evaluation. Our findings highlight the importance of understanding this tradeoff in meta-evaluation and its impact on metric rankings.
[79] Multilingual Hope Speech Detection: A Comparative Study of Logistic Regression, mBERT, and XLM-RoBERTa with Active Learning
T. O. Abiola, K. D. Abiodun, O. E. Olumide, O. O. Adebanji, O. Hiram Calvo, Grigori Sidorov
Main category: cs.CL
TL;DR: Multilingual framework for hope speech detection using active learning and transformer models (mBERT, XLM-RoBERTa) that achieves strong performance across English, Spanish, German, and Urdu datasets.
Details
Motivation: Hope speech detection is challenging in multilingual and low-resource settings, but plays a vital role in promoting positive online discourse.Method: Active learning approach combined with transformer-based models (mBERT and XLM-RoBERTa) tested on datasets in four languages with benchmark test sets from recent shared tasks.
Result: Transformer models significantly outperform traditional baselines, with XLM-RoBERTa achieving highest overall accuracy. Active learning maintained strong performance even with small annotated datasets.
Conclusion: Combining multilingual transformers with data-efficient training strategies is effective for hope speech detection.
Abstract: Hope speech language that fosters encouragement and optimism plays a vital role in promoting positive discourse online. However, its detection remains challenging, especially in multilingual and low-resource settings. This paper presents a multilingual framework for hope speech detection using an active learning approach and transformer-based models, including mBERT and XLM-RoBERTa. Experiments were conducted on datasets in English, Spanish, German, and Urdu, including benchmark test sets from recent shared tasks. Our results show that transformer models significantly outperform traditional baselines, with XLM-RoBERTa achieving the highest overall accuracy. Furthermore, our active learning strategy maintained strong performance even with small annotated datasets. This study highlights the effectiveness of combining multilingual transformers with data-efficient training strategies for hope speech detection.
[80] SIM-CoT: Supervised Implicit Chain-of-Thought
Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Xipeng Qiu, Dahua Lin
Main category: cs.CL
TL;DR: SIM-CoT addresses instability in implicit Chain-of-Thought methods by adding step-level supervision during training, improving performance and scalability while maintaining token efficiency.
Details
Motivation: Implicit CoT methods are token-efficient but suffer from performance gaps due to training instability when scaling computational budget, caused by latent representations becoming homogeneous without proper step-level supervision.Method: Proposes SIM-CoT, a plug-and-play training module that uses an auxiliary decoder to align implicit tokens with explicit reasoning steps, providing step-level supervision to stabilize latent representations. The decoder is removed during inference.
Result: SIM-CoT boosts baselines by +8.2% on GPT-2 and +3.0% on LLaMA-3.1 8B, surpasses explicit CoT baseline on GPT-2 by 2.1% with 2.3× greater token efficiency, and significantly improves in-domain accuracy and out-of-domain stability.
Conclusion: SIM-CoT effectively stabilizes implicit CoT training, closes the performance gap with explicit methods, maintains computational efficiency, and provides interpretability through step-level visualization.
Abstract: Implicit Chain-of-Thought (CoT) methods present a promising, token-efficient alternative to explicit CoT reasoning in Large Language Models (LLMs), but a persistent performance gap has limited the application of implicit CoT. We identify a core latent instability issue by scaling the computational budget of implicit CoT approaches: as we increase the number of implicit reasoning tokens to enhance performance, the training process often becomes unstable and collapses. Our analysis reveals that this instability arises from the latent representations becoming homogeneous and losing their semantic diversity, a failure caused by insufficient step-level supervision in existing implicit CoT approaches. To address this issue, we propose SIM-CoT, a plug-and-play training module that introduces step-level supervision to stabilize and enrich the latent reasoning space. Specifically, SIM-CoT employs an auxiliary decoder during training to align each implicit token with its corresponding explicit reasoning step, ensuring that latent states capture distinct and meaningful information. The proposed auxiliary decoder is removed during inference, preserving the computational efficiency of implicit CoT methods with no added overhead. In addition, the auxiliary decoder affords interpretability of implicit reasoning by projecting each latent token onto an explicit reasoning vocabulary, enabling per-step visualization of semantic roles and diagnosis. SIM-CoT significantly enhances both the in-domain accuracy and out-of-domain stability of various implicit CoT methods, boosting baselines like Coconut by +8.2% on GPT-2 and CODI by +3.0% on LLaMA-3.1 8B. Demonstrating strong scalability, SIM-CoT also surpasses the explicit CoT baseline on GPT-2 by 2.1% with 2.3\times greater token efficiency, while substantially closing the performance gap on larger models like LLaMA-3.1 8B.
[81] Morphological Synthesizer for Ge’ez Language: Addressing Morphological Complexity and Resource Limitations
Gebrearegawi Gebremariam, Hailay Teklehaymanot, Gebregewergs Mezgebe
Main category: cs.CL
TL;DR: This paper presents a rule-based morphological synthesizer for Ge’ez language that generates surface words from root words, achieving 97.4% accuracy on 1,102 test verbs.
Details
Motivation: Ge'ez is a significant ancient Semitic language with complex morphology, but lacks usable NLP tools due to scarcity of annotated linguistic resources. The language's importance for Ethiopian/Eritrean cultural heritage and identity documentation necessitates computational tools.Method: Developed a rule-based morphological synthesizer that generates surface words from root words according to Ge’ez morphological structures. Tested and evaluated using 1,102 sample verbs representing all verb morphological structures.
Result: The system achieved 97.4% performance, outperforming the baseline model and demonstrating effectiveness in handling Ge’ez morphological complexity.
Conclusion: The successful development of this rule-based synthesizer shows promise for Ge’ez NLP, suggesting future work should build comprehensive systems considering the language’s morphological variations.
Abstract: Ge’ez is an ancient Semitic language renowned for its unique alphabet. It serves as the script for numerous languages, including Tigrinya and Amharic, and played a pivotal role in Ethiopia’s cultural and religious development during the Aksumite kingdom era. Ge’ez remains significant as a liturgical language in Ethiopia and Eritrea, with much of the national identity documentation recorded in Ge’ez. These written materials are invaluable primary sources for studying Ethiopian and Eritrean philosophy, creativity, knowledge, and civilization. Ge’ez has a complex morphological structure with rich inflectional and derivational morphology, and no usable NLP has been developed and published until now due to the scarcity of annotated linguistic data, corpora, labeled datasets, and lexicons. Therefore, we propose a rule-based Ge’ez morphological synthesizer to generate surface words from root words according to the morphological structures of the language. We used 1,102 sample verbs, representing all verb morphological structures, to test and evaluate the system. The system achieves a performance of 97.4%, outperforming the baseline model and suggesting that future work should build a comprehensive system considering morphological variations of the language. Keywords: Ge’ez, NLP, morphology, morphological synthesizer, rule-based
[82] EmbeddingGemma: Powerful and Lightweight Text Representations
Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, Daniel Cer, Alice Lisak, Min Choi, Lucas Gonzalez, Omar Sanseviero, Glenn Cameron, Ian Ballantyne, Kat Black, Kaifeng Chen, Weiyi Wang, Zhe Li, Gus Martins, Jinhyuk Lee, Mark Sherwood, Juyeong Ji, Renjie Wu, Jingxiao Zheng, Jyotinder Singh, Abheesht Sharma, Divya Sreepat, Aashi Jain, Adham Elarabawy, AJ Co, Andreas Doumanoglou, Babak Samari, Ben Hora, Brian Potetz, Dahun Kim, Enrique Alfonseca, Fedor Moiseev, Feng Han, Frank Palma Gomez, Gustavo Hernández Ábrego, Hesen Zhang, Hui Hui, Jay Han, Karan Gill, Ke Chen, Koert Chen, Madhuri Shanbhogue, Michael Boratko, Paul Suganthan, Sai Meher Karthik Duddu, Sandeep Mariserla, Setareh Ariafar, Shanfeng Zhang, Shijie Zhang, Simon Baumgartner, Sonam Goenka, Steve Qiu, Tanmaya Dabral, Trevor Walker, Vikram Rao, Waleed Khawaja, Wenlei Zhou, Xiaoqi Ren, Ye Xia, Yichang Chen, Yi-Ting Chen, Zhe Dong, Zhongli Ding, Francesco Visin, Gaël Liu, Jiageng Zhang, Kathleen Kenealy, Michelle Casbon, Ravin Kumar, Thomas Mesnard, Zach Gleicher, Cormac Brick, Olivier Lacombe, Adam Roberts, Yunhsuan Sung, Raphael Hoffmann, Tris Warkentin, Armand Joulin, Tom Duerig, Mojtaba Seyedhosseini
Main category: cs.CL
TL;DR: EmbeddingGemma is a lightweight 300M parameter text embedding model that achieves state-of-the-art performance on MTEB benchmarks, outperforming larger models through innovative training techniques including encoder-decoder initialization and geometric embedding distillation.
Details
Motivation: To create a highly efficient text embedding model that provides exceptional performance-to-cost ratio for low-latency applications like on-device use, while maintaining state-of-the-art results across multilingual, English, and code domains.Method: Uses encoder-decoder initialization and geometric embedding distillation to capture knowledge from larger models, incorporates spread-out regularizer for robustness, and merges checkpoints from varied optimized mixtures for generalizability.
Result: Achieves state-of-the-art results on MTEB benchmarks, outperforms prior top models with fewer than 500M parameters, provides performance comparable to models double its size, and maintains lead even when quantized or truncated.
Conclusion: EmbeddingGemma offers exceptional efficiency and performance, making it well-suited for low-latency, high-throughput applications while being released to the community to promote further research.
Abstract: We introduce EmbeddingGemma, a new lightweight, open text embedding model based on the Gemma 3 language model family. Our innovative training recipe strategically captures knowledge from larger models via encoder-decoder initialization and geometric embedding distillation. We improve model robustness and expressiveness with a spread-out regularizer, and ensure generalizability by merging checkpoints from varied, optimized mixtures. Evaluated on the Massive Text Embedding Benchmark (MTEB) across multilingual, English, and code domains, EmbeddingGemma (300M) achieves state-of-the-art results. Notably, it outperforms prior top models, both proprietary and open, with fewer than 500M parameters, and provides performance comparable to models double its size, offering an exceptional performance-to-cost ratio. Remarkably, this lead persists when quantizing model weights or truncating embedding outputs. This makes EmbeddingGemma particularly well-suited for low-latency and high-throughput use cases such as on-device applications. We provide ablation studies exploring our key design choices. We release EmbeddingGemma to the community to promote further research.
[83] Language Models that Think, Chat Better
Adithya Bhaskar, Xi Ye, Danqi Chen
Main category: cs.CL
TL;DR: RLMT (RL with Model-rewarded Thinking) extends RLVR to open-ended tasks by optimizing language models to generate Chain-of-Thought reasoning before responses using preference-based rewards, achieving state-of-the-art performance on chat benchmarks without requiring large-scale SFT.
Details
Motivation: RLVR works well for verifiable domains like math and code but has limited generalization for open-ended tasks where humans routinely reason. The paper aims to extend the RLVR paradigm to general-purpose chat capabilities.Method: RLMT requires LMs to generate long Chain-of-Thought reasoning before responses and optimizes them with online RL against preference-based reward models. Tested across 40 training runs on Llama-3.1-8B and Qwen-2.5-7B using DPO, PPO, and GRPO algorithms.
Result: Consistent outperformance of standard RLHF pipelines with 3-7 point gains on chat benchmarks (AlpacaEval2, WildBench, ArenaHardV2) and 1-3 point improvements on creative writing and general knowledge. Best 8B model surpasses GPT-4o in chat/creative writing and rivals Claude-3.7-Sonnet.
Conclusion: RLMT rethinks post-training pipelines, demonstrating that thinking-based optimization with minimal data (7K prompts) can outperform complex multi-stage pipelines with 25M+ examples, calling for broader understanding and employment of thinking in LM training.
Abstract: Reinforcement learning with verifiable rewards (RLVR) improves language model reasoning by using rule-based rewards in verifiable domains such as mathematics and code. However, RLVR leads to limited generalization for open-ended tasks – such as writing outline essays or making meal plans – where humans reason routinely. This paper shows that the RLVR paradigm is effective beyond verifiable domains, and introduces RL with Model-rewarded Thinking (RLMT) for general-purpose chat capabilities. Using diverse real-world prompts, RLMT requires LMs to generate long CoT reasoning before response, and optimizes them with online RL against a preference-based reward model used in RLHF. Across 40 training runs on Llama-3.1-8B and Qwen-2.5-7B (both base and instruct) and multiple optimization algorithms (DPO, PPO, and GRPO), RLMT consistently outperforms standard RLHF pipelines. This includes substantial gains of 3-7 points on three chat benchmarks (AlpacaEval2, WildBench, and ArenaHardV2), along with 1-3 point improvements on other tasks like creative writing and general knowledge. Our best 8B model surpasses GPT-4o in chat and creative writing and rivals Claude-3.7-Sonnet (Thinking). RLMT can also be applied directly to base models without an SFT stage, akin to R1-Zero training. Remarkably, with only 7K prompts, Llama-3.1-8B base trained with our RLMT recipe outperforms Llama-3.1-8B-Instruct post-trained with a complex multi-staged pipeline with 25M+ examples. We close with qualitative and quantitative analyses of how trained models plan their responses. Our results rethink the post-training pipeline and call upon future work to understand and employ thinking more broadly.
[84] TALEC: Teach Your LLM to Evaluate in Specific Domain with In-house Criteria by Criteria Division and Zero-shot Plus Few-shot
Kaiqi Zhang, Shuai Yuan, Honghan Zhao
Main category: cs.CL
TL;DR: TALEC is a model-based evaluation method for LLMs that uses in-context learning to teach judge models custom evaluation criteria, achieving over 80% correlation with human judgments.
Details
Motivation: Current LLM evaluation in business scenarios relies on expensive manual methods and struggles to meet both general standards and specific customer/business security requirements simultaneously.Method: Proposes TALEC using in-context learning to teach judge models custom criteria, combines zero-shot and few-shot approaches, and introduces a prompt paradigm for iterative shot adjustment. Compares fine-tuning vs ICL.
Result: TALEC achieves over 80% correlation with human judgments, outperforming inter-human correlation in some tasks. Fine-tuning can be replaced by ICL.
Conclusion: TALEC provides an effective automated evaluation method that accurately reflects human preferences and addresses the limitations of manual evaluation in business scenarios.
Abstract: With the rapid development of large language models (LLM), the evaluation of LLM becomes increasingly important. Measuring text generation tasks such as summarization and article creation is very difficult. Especially in specific application domains (e.g., to-business or to-customer service), in-house evaluation criteria have to meet not only general standards (correctness, helpfulness and creativity, etc.) but also specific needs of customers and business security requirements at the same time, making the evaluation more difficult. So far, the evaluation of LLM in business scenarios has mainly relied on manual, which is expensive and time-consuming. In this paper, we propose a model-based evaluation method: TALEC, which allows users to flexibly set their own evaluation criteria, and uses in-context learning (ICL) to teach judge model these in-house criteria. In addition, we try combining zero-shot and few-shot to make the judge model focus on more information. We also propose a prompt paradigm and an engineering approach to adjust and iterate the shots ,helping judge model to better understand the complex criteria. We then compare fine-tuning with ICL, finding that fine-tuning can be replaced by ICL. TALEC demonstrates a strong capability to accurately reflect human preferences and achieves a correlation of over 80% with human judgments, outperforming even the inter-human correlation in some tasks. The code is released in https://github.com/zlkqz/auto_eval
[85] Context-Masked Meta-Prompting for Privacy-Preserving LLM Adaptation in Finance
Sayash Raaj Hiraou
Main category: cs.CL
TL;DR: Iterative meta-prompting methodology for privacy-preserving LLM optimization in financial applications, achieving 103.87% ROUGE-L F1 improvement for question answering.
Details
Motivation: Address privacy preservation and regulatory compliance needs for LLMs in sensitive financial domains without exposing proprietary/confidential context to the model.Method: Iterative meta-prompting with novel regeneration process involving feeder and propagation methods to optimize hard prompts while maintaining privacy.
Result: Significant improvements in prompt efficacy with 103.87% improvement in ROUGE-L F1 for question answering on financial task proxies using GPT-3.5 Turbo.
Conclusion: Practical, low-cost strategy for adapting LLMs to financial applications while upholding privacy and auditability standards in generative AI.
Abstract: The increasing reliance on Large Language Models (LLMs) in sensitive domains like finance necessitates robust methods for privacy preservation and regulatory compliance. This paper presents an iterative meta-prompting methodology designed to optimise hard prompts without exposing proprietary or confidential context to the LLM. Through a novel regeneration process involving feeder and propagation methods, we demonstrate significant improvements in prompt efficacy. Evaluated on public datasets serving as proxies for financial tasks such as SQuAD for extractive financial Q&A, CNN/DailyMail for news summarisation, and SAMSum for client interaction summarisation, our approach, utilising GPT-3.5 Turbo, achieved a 103.87% improvement in ROUGE-L F1 for question answering. This work highlights a practical, low-cost strategy for adapting LLMs to financial applications while upholding critical privacy and auditability standards, offering a compelling case for its relevance in the evolving landscape of generative AI in finance.
[86] Efficient Fine-Tuning of Large Language Models for Automated Medical Documentation
Hui Yi Leong, Yi Fan Gao, Ji Shuai, Yang Zhang, Uktu Pamuksuz
Main category: cs.CL
TL;DR: MediGen is a fine-tuned LLM that automates medical report generation from dialogues, achieving 58% ROUGE and 72% BERTScore-F1, potentially reducing physician administrative burden.
Details
Motivation: Physicians spend 2 hours on administrative tasks for every 1 hour of patient care, leading to burnout and inefficiencies in healthcare delivery.Method: Fine-tuned LLaMA3-8B model using state-of-the-art methodologies to generate medical reports from clinical dialogues.
Result: The model achieved a ROUGE score of 58% and BERTScore-F1 of 72%, demonstrating high accuracy in medical report generation.
Conclusion: MediGen can significantly reduce administrative workload on physicians, improving healthcare efficiency and physician well-being.
Abstract: Scientific research indicates that for every hour spent in direct patient care, physicians spend nearly two additional hours on administrative tasks, particularly on electronic health records (EHRs) and desk work. This excessive administrative burden not only reduces the time available for patient care but also contributes to physician burnout and inefficiencies in healthcare delivery. To address these challenges, this study introduces MediGen, a fine-tuned large language model (LLM) designed to automate the generation of medical reports from medical dialogues. By leveraging state-of-the-art methodologies for fine-tuning open-source pretrained models, including LLaMA3-8B, MediGen achieves high accuracy in transcribing and summarizing clinical interactions. The fine-tuned LLaMA3-8B model demonstrated promising results, achieving a ROUGE score of 58% and a BERTScore-F1 of 72%, indicating its effectiveness in generating accurate and clinically relevant medical reports. These findings suggest that MediGen has the potential to significantly reduce the administrative workload on physicians, improving both healthcare efficiency and physician well-being.
[87] Evading Toxicity Detection with ASCII-art: A Benchmark of Spatial Attacks on Moderation Systems
Sergey Berezin, Reza Farahbakhsh, Noel Crespi
Main category: cs.CL
TL;DR: The paper introduces ToxASCII, a novel adversarial attack method that uses ASCII art to bypass toxicity detection models, achieving perfect attack success rates across state-of-the-art systems.
Details
Motivation: Current toxicity detection models fail to properly interpret spatially structured text like ASCII art, creating a significant vulnerability that can be exploited to bypass content moderation systems.Method: The authors propose ToxASCII, a benchmark that uses ASCII art to visually obfuscate toxic content, testing the robustness of various toxicity detection models against these spatially structured text attacks.
Result: The attacks achieved a perfect Attack Success Rate (ASR) across diverse state-of-the-art large language models and dedicated moderation tools, demonstrating severe vulnerabilities in current text-only moderation systems.
Conclusion: The research reveals critical weaknesses in existing toxicity detection systems and highlights the need for more robust moderation approaches that can handle visually obfuscated content like ASCII art.
Abstract: We introduce a novel class of adversarial attacks on toxicity detection models that exploit language models’ failure to interpret spatially structured text in the form of ASCII art. To evaluate the effectiveness of these attacks, we propose ToxASCII, a benchmark designed to assess the robustness of toxicity detection systems against visually obfuscated inputs. Our attacks achieve a perfect Attack Success Rate (ASR) across a diverse set of state-of-the-art large language models and dedicated moderation tools, revealing a significant vulnerability in current text-only moderation systems.
[88] UNComp: Can Matrix Entropy Uncover Sparsity? – A Compressor Design from an Uncertainty-Aware Perspective
Jing Xiong, Jianghan Shen, Fanghua Ye, Chaofan Tao, Zhongwei Wan, Jianqiao Lu, Xun Wu, Chuanyang Zheng, Zhijiang Guo, Min Yang, Lingpeng Kong, Ngai Wong
Main category: cs.CL
TL;DR: UNComp is an uncertainty-aware framework that uses truncated matrix entropy to identify low-information areas in LLM KV caches, enabling adaptive compression that reduces cache size to 4.74% of original while improving performance.
Details
Motivation: Current KV cache compression methods neglect the structured sparsity in LLM hidden states and KV cache relationships, failing to leverage uncertainty as an indicator of sparsity for optimized compression.Method: Proposes UNComp framework that uses truncated matrix entropy to measure uncertainty and identify sparsity patterns, enabling dynamic, adaptive compression based on uncertainty measures rather than uniform compression.
Result: Reduces KV cache size to 4.74% of original, achieves 6% prefill speedup, improves throughput by 6.4x, and reveals special long-range dependencies like retrieval heads and layers.
Conclusion: Uncertainty-based sparsity analysis provides effective compression optimization and new insights into LLM sparsity patterns, validating the theoretical approach while delivering strong lossless performance.
Abstract: Deploying large language models (LLMs) for long-context inference remains challenging due to their substantial memory and computational demands. While techniques such as Key-Value (KV) cache compression are designed to reduce memory usage, they often neglect the structured sparsity inherent in the relationship between hidden states and their corresponding KV cache. In this work, we explore the role of uncertainty as a potential indicator of sparsity within LLMs. We propose UNComp, an uncertainty-aware framework that leverages truncated matrix entropy to identify areas of low information content, thereby revealing sparsity patterns that can be used for adaptive compression. Unlike traditional methods that apply uniform compression, UNComp dynamically adjusts its approach to compression, guided by uncertainty measures that reflect the importance of various model components. Our analysis shows that sparsity patterns, when derived from uncertainty estimates, can be exploited to reveal special long-range dependencies, such as retrieval heads and retrieval layers. This perspective not only enhances our understanding of how compression can be optimized but also provides new insights into the inherent sparsity of LLMs during long-context inference. By focusing on uncertainty to analyze the sparsity pattern in detail, UNComp reduces the KV cache size to 4.74% of the original, achieves a 6% prefill speedup, and improves throughput by 6.4x - not only delivering strong lossless compression performance, but also validating the effectiveness of the underlying theoretical tool. We release the code at https://github.com/menik1126/UNComp.
[89] Blind Men and the Elephant: Diverse Perspectives on Gender Stereotypes in Benchmark Datasets
Mahdi Zakizadeh, Mohammad Taher Pilehvar
Main category: cs.CL
TL;DR: Current gender stereotype benchmarks provide fragmented views of bias in language models. By balancing data across gender stereotype components using social psychology frameworks, simple techniques can improve correlation between different measurement approaches.
Details
Motivation: Current benchmarks underestimate the complexity of measuring gender stereotypical bias in language models and fail to capture the full extent of the problem, providing only partial and inconsistent views.Method: Used StereoSet and CrowS-Pairs as case studies to investigate how data distribution affects benchmark results. Applied a framework from social psychology to balance benchmark data across various components of gender stereotypes.
Result: Simple balancing techniques significantly improved the correlation between different measurement approaches, demonstrating that current benchmarks capture only partial facets of gender stereotypes.
Conclusion: The findings underscore the complexity of gender stereotyping in language models and point to new directions for developing more refined techniques to detect and reduce bias.
Abstract: Accurately measuring gender stereotypical bias in language models is a complex task with many hidden aspects. Current benchmarks have underestimated this multifaceted challenge and failed to capture the full extent of the problem. This paper examines the inconsistencies between intrinsic stereotype benchmarks. We propose that currently available benchmarks each capture only partial facets of gender stereotypes, and when considered in isolation, they provide just a fragmented view of the broader landscape of bias in language models. Using StereoSet and CrowS-Pairs as case studies, we investigated how data distribution affects benchmark results. By applying a framework from social psychology to balance the data of these benchmarks across various components of gender stereotypes, we demonstrated that even simple balancing techniques can significantly improve the correlation between different measurement approaches. Our findings underscore the complexity of gender stereotyping in language models and point to new directions for developing more refined techniques to detect and reduce bias.
[90] Understanding Before Reasoning: Enhancing Chain-of-Thought with Iterative Summarization Pre-Prompting
Dong-Hai Zhu, Yu-Jie Xiong, Jia-Chen Zhang, Xi-Jiong Xie, Chun-Ming Xia
Main category: cs.CL
TL;DR: ISP^2 is a pre-prompting method that improves Chain-of-Thought reasoning by iteratively extracting and merging key information pairs before generating answers, achieving 7.1% performance improvement.
Details
Motivation: Chain-of-Thought prompting struggles when key reasoning information is implicit or missing, as it focuses on reasoning steps without early extraction of essential information.Method: Extract entities and descriptions to form key information pairs, rate their reliability, iteratively merge lowest-ranked pairs until obtaining a unique key pair, then feed this with the original question to LLMs.
Result: Extensive experiments show 7.1% improvement over existing methods.
Conclusion: ISP^2 offers an inductive pre-prompting approach that flexibly integrates into diverse reasoning frameworks, addressing CoT’s limitations with implicit information.
Abstract: Chain-of-Thought (CoT) Prompting is a dominant paradigm in Large Language Models (LLMs) to enhance complex reasoning. It guides LLMs to present multi-step reasoning, rather than generating the final answer directly. However, CoT encounters difficulties when key information required for reasoning is implicit or missing. This occurs because CoT emphasizes the sequence of reasoning steps while overlooking the early extraction of essential information. We propose a pre-prompting method called Iterative Summarization Pre-Prompting (ISP^2) to refine LLM reasoning when key information is not explicitly provided. First, entities and their corresponding descriptions are extracted to form potential key information pairs. Next, we use a reliability rating to assess these pairs, then merge the two lowest-ranked pairs into a new entity description. This process is repeated until a unique key information pair is obtained. Finally, that pair, along with the original question, is fed into LLMs to produce the answer. Extensive experiments demonstrate a 7.1% improvement compared to existing methods. Unlike traditional prompting, ISP^2 adopts an inductive approach with pre-prompting, offering flexible integration into diverse reasoning frameworks. The code is available at https://github.com/zdhgreat/ISP-2.
[91] LLMs Reproduce Stereotypes of Sexual and Gender Minorities
Ruby Ostrow, Adam Lopez
Main category: cs.CL
TL;DR: This paper examines gender bias in NLP systems beyond binary categories, focusing on sexual and gender minorities using the Stereotype Content Model, and shows that LLMs amplify negative stereotypes in text generation.
Details
Motivation: Most existing research on gender bias in NLP systems takes a binary view of gender, ignoring the spectrum of gender and sexual identities. This study aims to address this gap by analyzing biases towards sexual and gender minorities.Method: The study uses the Stereotype Content Model from social psychology to analyze biases. It first applies English-language survey questions to assess social perceptions from both humans and LLMs, then extends the framework to text generation tasks.
Result: Both humans and LLMs exhibit more negative stereotypes towards sexual and gender minorities. In text generation, LLMs produce stereotyped representations of these groups, amplifying representational harms in creative writing.
Conclusion: LLMs not only reflect but amplify existing societal biases against sexual and gender minorities, highlighting the need for more inclusive approaches in NLP systems to mitigate these harms.
Abstract: A large body of research has found substantial gender bias in NLP systems. Most of this research takes a binary, essentialist view of gender: limiting its variation to the categories men and women, conflating gender with sex, and ignoring different sexual identities. But gender and sexuality exist on a spectrum, so in this paper we study the biases of large language models (LLMs) towards sexual and gender minorities beyond binary categories. Grounding our study in a widely used social psychology model – the Stereotype Content Model – we demonstrate that English-language survey questions about social perceptions elicit more negative stereotypes of sexual and gender minorities from both humans and LLMs. We then extend this framework to a more realistic use case: text generation. Our analysis shows that LLMs generate stereotyped representations of sexual and gender minorities in this setting, showing that they amplify representational harms in creative writing, a widely advertised use for LLMs.
[92] BAP v2: An Enhanced Task Framework for Instruction Following in Minecraft Dialogues
Prashant Jayannavar, Liliang Ren, Marisa Hudspeth, Risham Sidhu, Charlotte Lambert, Ariel Cordes, Elizabeth Kaplan, Anjali Narayan-Chen, Julia Hockenmaier
Main category: cs.CL
TL;DR: The paper introduces BAP v2, an enhanced benchmark for the Builder Action Prediction subtask in Minecraft Collaborative Building, addressing evaluation issues, data scarcity, and modeling challenges. It presents a new SOTA model (Llama-CRAFTS) that achieves 53.0 F1 score and shows synthetic data training improves spatial reasoning capabilities.
Details
Motivation: To develop interactive agents that understand language, perceive surroundings, and act in physical worlds, using Minecraft Collaborative Building Task as a testbed for grounded instruction following with limited training data.Method: Defined enhanced evaluation benchmark with cleaner test set and better metrics; generated synthetic MCBT data to address data scarcity; introduced Llama-CRAFTS model with richer input representations.
Result: Llama-CRAFTS achieved 53.0 F1 score on BAP v2, a 6-point improvement over previous work. Models trained on synthetic data showed improved performance across all tasks, revealing spatial reasoning as the primary bottleneck.
Conclusion: BAP v2 establishes a fertile ground for future research and provides a useful measure of current text-only LLMs’ spatial capabilities in embodied tasks, highlighting the remaining difficulty despite notable improvements.
Abstract: Developing interactive agents that can understand language, perceive their surroundings, and act within the physical world is a long-standing goal of AI research. The Minecraft Collaborative Building Task (MCBT) (Narayan-Chen, Jayannavar, and Hockenmaier 2019), a two-player game in which an Architect (A) instructs a Builder (B) to construct a target structure in a simulated 3D Blocks World environment, offers a rich platform to work towards this goal. In this work, we focus on the Builder Action Prediction (BAP) subtask: predicting B’s actions in a multimodal game context (Jayannavar, Narayan-Chen, and Hockenmaier 2020) - a challenging testbed for grounded instruction following, with limited training data. We holistically re-examine this task and introduce BAP v2 to address key challenges in evaluation, training data, and modeling. Specifically, we define an enhanced evaluation benchmark, featuring a cleaner test set and fairer, more insightful metrics that also reveal spatial reasoning as the primary performance bottleneck. To address data scarcity and to teach models basic spatial skills, we generate different types of synthetic MCBT data. We observe that current, LLM-based SOTA models trained on the human BAP dialogues fail on these simpler, synthetic BAP ones, but show that training models on this synthetic data improves their performance across the board. We also introduce a new SOTA model, Llama-CRAFTS, which leverages richer input representations, and achieves an F1 score of 53.0 on the BAP v2 task and strong performance on the synthetic data. While this result marks a notable 6 points improvement over previous work, it also underscores the task’s remaining difficulty, establishing BAP v2 as a fertile ground for future research, and providing a useful measure of the spatial capabilities of current text-only LLMs in such embodied tasks.
[93] LLMs as a synthesis between symbolic and distributed approaches to language
Gemma Boleda
Main category: cs.CL
TL;DR: This position paper argues that deep learning models for language represent a synthesis between symbolic and distributed approaches to cognition, showing how LLMs flexibly use both discrete and continuous representations.
Details
Motivation: To bridge the long-standing divide between symbolic and distributed approaches to language and cognition, and demonstrate that deep learning models successfully integrate both traditions.Method: Review of recent interpretability research showing how morphosyntactic knowledge is encoded in near-discrete fashion in LLMs, and analysis of how models flexibly alternate between distributed and symbolic processing modes.
Result: Evidence shows that LLMs encode substantial morphosyntactic knowledge in near-discrete representations and can flexibly switch between different processing modes as needed.
Conclusion: Deep learning models represent a successful synthesis of symbolic and distributed approaches, which explains their success and makes them particularly valuable for language study, suggesting it’s time for peace between the two traditions.
Abstract: Since the middle of the 20th century, a fierce battle is being fought between symbolic and distributed approaches to language and cognition. The success of deep learning models, and LLMs in particular, has been alternatively taken as showing that the distributed camp has won, or dismissed as an irrelevant engineering development. In this position paper, I argue that deep learning models for language actually represent a synthesis between the two traditions. This is because 1) deep learning architectures allow for both distributed/continuous/fuzzy and symbolic/discrete/categorical-like representations and processing; 2) models trained on language make use of this flexibility. In particular, I review recent research in interpretability that showcases how a substantial part of morphosyntactic knowledge is encoded in a near-discrete fashion in LLMs. This line of research suggests that different behaviors arise in an emergent fashion, and models flexibly alternate between the two modes (and everything in between) as needed. This is possibly one of the main reasons for their wild success; and it makes them particularly interesting for the study of language. Is it time for peace?
[94] Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests
Filippo Momentè, Alessandro Suglia, Mario Giulianelli, Ambra Ferrari, Alexander Koller, Oliver Lemon, David Schlangen, Raquel Fernández, Raffaella Bernardi
Main category: cs.CL
TL;DR: Interactive games are more effective than standard benchmarks at discriminating LLM quality, with cognitive tests revealing correlations between reasoning abilities and model performance.
Details
Motivation: To determine which evaluation paradigm (benchmarks vs. interactive games) best discriminates LLM quality, and to investigate how cognitive abilities correlate with model performance.Method: Examined three evaluation paradigms: standard benchmarks (MMLU, BBH), interactive games (Signalling Games, Taboo), and cognitive tests (working memory, theory of mind). Compiled a suite of targeted cognitive tests and analyzed correlations with model performance.
Result: Interactive games are superior to standard benchmarks in discriminating models. Causal/logical reasoning correlates with both test types, while executive functions and social/emotional skills correlate more with games.
Conclusion: Advocates for developing new interactive benchmarks and targeted cognitive tasks specifically designed for LLMs, inspired by human ability assessments.
Abstract: We examine three evaluation paradigms: standard benchmarks (e.g., MMLU and BBH), interactive games (e.g., Signalling Games or Taboo), and cognitive tests (e.g., for working memory or theory of mind). First, we investigate which of the former two-benchmarks or games-is most effective at discriminating LLMs of varying quality. Then, inspired by human cognitive assessments, we compile a suite of targeted tests that measure cognitive abilities deemed essential for effective language use, and we investigate their correlation with model performance in benchmarks and games. Our analyses reveal that interactive games are superior to standard benchmarks in discriminating models. Causal and logical reasoning correlate with both static and interactive tests, while differences emerge regarding core executive functions and social/emotional skills, which correlate more with games. We advocate for the development of new interactive benchmarks and targeted cognitive tasks inspired by assessing human abilities but designed specifically for LLMs.
[95] Bridging Information Gaps with Comprehensive Answers: Improving the Diversity and Informativeness of Follow-Up Questions
Zhe Liu, Taekyu Kang, Haoyu Wang, Seyed Hossein Alavi, Vered Shwartz
Main category: cs.CL
TL;DR: A knowledge distillation pipeline that uses teacher LLMs to generate comprehensive answers, identify information gaps compared to initial answers, and create gap-bridging follow-up questions to augment training data for smaller student models.
Details
Motivation: To help small, locally hosted conversational agents generate more diverse and informative follow-up questions that uncover missing information, which remains challenging for resource-constrained systems.Method: Information-gap-driven knowledge distillation pipeline where a teacher LLM generates comprehensive answers, contrasts them with initial answers to identify gaps, formulates follow-up questions, and uses this to augment the FollowupQG dataset tenfold for fine-tuning smaller student models.
Result: Fine-tuned student models achieve significantly higher informativeness and diversity than variations trained on the original dataset, showing effective knowledge transfer from teacher to student models.
Conclusion: The pipeline provides an efficient distillation channel from state-of-the-art LLMs to smaller models, enabling resource-constrained conversational systems to generate more diverse and informative follow-up questions by mirroring human cognitive processes of information seeking.
Abstract: Generating diverse follow-up questions that uncover missing information remains challenging for conversational agents, particularly when they run on small, locally hosted models. To address this, we develop an information-gap-driven knowledge distillation pipeline in which a teacher LLM generates a comprehensive answer, contrasts it with the initial answer to identify information gaps, and formulates gap-bridging follow-up questions. Using this pipeline, we augment the existing FollowupQG dataset tenfold. We then fine-tune smaller student models on the augmented dataset to distill the teacher’s knowledge. Experiments with selected teacher-student model pairs show that fine-tuned students achieve significantly higher informativeness and diversity than variations trained on the original dataset. These findings indicate that our pipeline, which mirrors the human cognitive process of information seeking, provides an efficient distillation channel from state-of-the-art LLMs to smaller models, enabling resource-constrained conversational systems to generate more diverse and informative follow-up questions.
[96] What are Foundation Models Cooking in the Post-Soviet World?
Anton Lavrouk, Tarek Naous, Alan Ritter, Wei Xu
Main category: cs.CL
TL;DR: This paper introduces BORSch, a multimodal dataset for evaluating cultural food knowledge in foundation models, showing they struggle with Post-Soviet dish identification due to language biases and misleading data patterns.
Details
Motivation: To investigate how foundation models handle cultural knowledge from Post-Soviet states, particularly food knowledge, and to address biases where models incorrectly attribute dish origins based on language rather than cultural context.Method: Created BORSch dataset with 1147 Russian and 823 Ukrainian dishes, tested models on text-only and multimodal QA for dish origin identification, analyzed pretraining data for co-occurrence patterns, and evaluated visual description generation capabilities.
Result: Leading models consistently misattribute Post-Soviet dish origins to countries associated with the query language, show weak correlation between QA performance and visual description accuracy, and reveal linguistic biases like Russian-Ukrainian code mixing in training data.
Conclusion: QA alone is insufficient for evaluating cultural understanding; multimodal approaches are needed. The BORSch dataset enables better assessment of cultural knowledge in AI systems.
Abstract: The culture of the Post-Soviet states is complex, shaped by a turbulent history that continues to influence current events. In this study, we investigate the Post-Soviet cultural food knowledge of foundation models by constructing BORSch, a multimodal dataset encompassing 1147 and 823 dishes in the Russian and Ukrainian languages, centered around the Post-Soviet region. We demonstrate that leading models struggle to correctly identify the origins of dishes from Post-Soviet nations in both text-only and multimodal Question Answering (QA), instead over-predicting countries linked to the language the question is asked in. Through analysis of pretraining data, we show that these results can be explained by misleading dish-origin co-occurrences, along with linguistic phenomena such as Russian-Ukrainian code mixing. Finally, to move beyond QA-based assessments, we test models’ abilities to produce accurate visual descriptions of dishes. The weak correlation between this task and QA suggests that QA alone may be insufficient as an evaluation of cultural understanding. To foster further research, we will make BORSch publicly available at https://github.com/alavrouk/BORSch.
[97] MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering
Teng Lin, Yuyu Luo, Honglin Zhang, Jicheng Zhang, Chunlin Liu, Kaishun Wu, Nan Tang
Main category: cs.CL
TL;DR: MEBench is a new benchmark for evaluating LLMs and RAG systems on multi-entity question answering, revealing that even advanced models achieve only 59% accuracy on complex entity-dense questions requiring cross-document integration.
Details
Motivation: Current LLMs and RAG systems struggle with multi-entity QA tasks that require consolidating scattered information across diverse documents, particularly for entity-dense questions requiring cross-document aggregation.Method: The authors introduce MEBench, a multi-document, multi-entity benchmark comprising 4,780 questions categorized into three primary categories and eight distinct types, using Entity-Attributed F1 (EA-F1) metric for granular evaluation.
Result: Experiments on state-of-the-art LLMs (GPT-4, Llama-3) and RAG pipelines show critical limitations, with only 59% accuracy achieved on MEBench.
Conclusion: MEBench highlights systemic weaknesses in current LLM frameworks and provides a foundation for advancing robust, entity-aware QA architectures that emphasize completeness and factual precision of information extraction.
Abstract: Multi-entity question answering (MEQA) represents significant challenges for large language models (LLM) and retrieval-augmented generation (RAG) systems, which frequently struggle to consolidate scattered information across diverse documents. While existing methods excel at single-document comprehension, they often struggle with cross-document aggregation, particularly when resolving entity-dense questions like “What is the distribution of ACM Fellows among various fields of study?”, which require integrating entity-centric insights from heterogeneous sources (e.g., Wikipedia pages). To address this gap, we introduce MEBench, a novel multi-document, multi-entity benchmark designed to systematically evaluate LLMs’ capacity to retrieve, consolidate, and reason over fragmented information. Our benchmark comprises 4,780 questions which are systematically categorized into three primary categories, further divided into eight distinct types, ensuring broad coverage of real-world multi-entity reasoning scenarios. Our experiments on state-of-the-art LLMs (e.g., GPT-4, Llama-3) and RAG pipelines reveal critical limitations: even advanced models achieve only 59% accuracy on MEBench. Our benchmark emphasizes the importance of completeness and factual precision of information extraction in MEQA tasks, using Entity-Attributed F1 (EA-F1) metric for granular evaluation of entity-level correctness and attribution validity. MEBench not only highlights systemic weaknesses in current LLM frameworks but also provides a foundation for advancing robust, entity-aware QA architectures.
[98] HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs
Tin Nguyen, Logan Bolton, Mohammad Reza Taesiri, Trung Bui, Anh Totti Nguyen
Main category: cs.CL
TL;DR: Highlighted Chain-of-Thought Prompting (HoT) is a technique that uses XML tags to ground LLM responses to input facts, improving accuracy on various tasks but potentially misleading users when the LLM is wrong.
Details
Motivation: To address LLM hallucination by making factual statements more verifiable for humans through explicit grounding of responses to input facts.Method: HoT prompts LLMs to first reformat questions with XML tags highlighting key facts, then generate responses with highlights referencing those input facts.
Result: HoT outperforms vanilla chain of thought prompting on 17 tasks and helps time-limited users verify correct responses more efficiently, but may increase false confidence in incorrect responses.
Conclusion: While HoT improves factual grounding and verification efficiency, it introduces a risk of over-trust in incorrect LLM outputs due to the highlighting mechanism.
Abstract: An Achilles heel of Large Language Models (LLMs) is their tendency to hallucinate non-factual statements. A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on. To combat this problem, we propose Highlighted Chain-of-Thought Prompting (HoT), a technique for prompting LLMs to generate responses with XML tags that ground facts to those provided in the query. That is, given an input question, LLMs would first re-format the question to add XML tags highlighting key facts, and then, generate a response with highlights over the facts referenced from the input. Interestingly, in few-shot settings, HoT outperforms vanilla chain of thought prompting (CoT) on a wide range of 17 tasks from arithmetic, reading comprehension to logical reasoning. When asking humans to verify LLM responses, highlights help time-limited participants to more accurately and efficiently recognize when LLMs are correct. Yet, surprisingly, when LLMs are wrong, HoTs tend to make users believe that an answer is correct.
[99] Large Language Models for Multilingual Previously Fact-Checked Claim Detection
Ivan Vykopal, Matúš Pikuliak, Simon Ostermann, Tatiana Anikina, Michal Gregor, Marián Šimko
Main category: cs.CL
TL;DR: This paper presents the first comprehensive evaluation of LLMs for multilingual previously fact-checked claim detection across 20 languages, showing strong performance for high-resource languages but limitations with low-resource languages.
Details
Motivation: To address the challenge of duplicated fact-checking efforts across languages and countries as false information transcends linguistic boundaries, enabling automatic detection of previously fact-checked claims across different languages.Method: Evaluated seven large language models (LLMs) across 20 languages in both monolingual and cross-lingual settings, including testing translation approaches where original texts were translated into English.
Result: LLMs perform well for high-resource languages but struggle with low-resource languages. Translating original texts into English proved beneficial for low-resource languages.
Conclusion: LLMs show potential for multilingual previously fact-checked claim detection, providing a foundation for further research, though performance varies significantly across language resource levels.
Abstract: In our era of widespread false information, human fact-checkers often face the challenge of duplicating efforts when verifying claims that may have already been addressed in other countries or languages. As false information transcends linguistic boundaries, the ability to automatically detect previously fact-checked claims across languages has become an increasingly important task. This paper presents the first comprehensive evaluation of large language models (LLMs) for multilingual previously fact-checked claim detection. We assess seven LLMs across 20 languages in both monolingual and cross-lingual settings. Our results show that while LLMs perform well for high-resource languages, they struggle with low-resource languages. Moreover, translating original texts into English proved to be beneficial for low-resource languages. These findings highlight the potential of LLMs for multilingual previously fact-checked claim detection and provide a foundation for further research on this promising application of LLMs.
[100] Language Models Fail to Introspect About Their Knowledge of Language
Siyuan Song, Jennifer Hu, Kyle Mahowald
Main category: cs.CL
TL;DR: LLMs cannot introspect about their internal states despite high task accuracy from metalinguistic prompting, showing no privileged self-access beyond what similar models can predict.
Details
Motivation: To investigate whether LLMs can introspect about their internal linguistic knowledge, which would enhance interpretability and validate linguistic evaluation methods.Method: Systematically evaluated 21 open-source LLMs across grammatical knowledge and word prediction domains, comparing prompted responses with string probabilities and controlling for model similarity.
Result: No evidence found for LLM introspection; prompted responses do not reflect privileged internal knowledge beyond what similar models can predict.
Conclusion: LLMs cannot introspect, and prompted responses should not be conflated with models’ actual linguistic generalizations.
Abstract: There has been recent interest in whether large language models (LLMs) can introspect about their own internal states. Such abilities would make LLMs more interpretable, and also validate the use of standard introspective methods in linguistics to evaluate grammatical knowledge in models (e.g., asking “Is this sentence grammatical?”). We systematically investigate emergent introspection across 21 open-source LLMs, in two domains where introspection is of theoretical interest: grammatical knowledge and word prediction. Crucially, in both domains, a model’s internal linguistic knowledge can be theoretically grounded in direct measurements of string probability. We then evaluate whether models’ responses to metalinguistic prompts faithfully reflect their internal knowledge. We propose a new measure of introspection: the degree to which a model’s prompted responses predict its own string probabilities, beyond what would be predicted by another model with nearly identical internal knowledge. While both metalinguistic prompting and probability comparisons lead to high task accuracy, we do not find evidence that LLMs have privileged “self-access”. By using general tasks, controlling for model similarity, and evaluating a wide range of open-source models, we show that LLMs cannot introspect, and add new evidence to the argument that prompted responses should not be conflated with models’ linguistic generalizations.
[101] Modeling Subjectivity in Cognitive Appraisal with Language Models
Yuxiang Zhou, Hainiu Xu, Desmond C. Ong, Maria Liakata, Petr Slovak, Yulan He
Main category: cs.CL
TL;DR: This paper explores how language models can quantify subjectivity in cognitive appraisal, examining factors like personality traits and demographic information that influence subjective measurements.
Details
Motivation: As language models are increasingly used in human-centered studies, there's a need to understand their capability to model subjectivity - a crucial factor in cognitive science that remains under-explored at the intersection with NLP.Method: Conducted comprehensive experiments and analyses with both fine-tuned models and prompt-based large language models (LLMs) to quantify subjectivity in cognitive appraisal.
Result: Personality traits and demographic information are critical for measuring subjectivity, but existing post-hoc calibration methods often fail to achieve satisfactory performance.
Conclusion: The study provides valuable insights to guide future research at the intersection of NLP and cognitive science, highlighting the importance of better modeling subjectivity in language models.
Abstract: As the utilization of language models in interdisciplinary, human-centered studies grow, expectations of their capabilities continue to evolve. Beyond excelling at conventional tasks, models are now expected to perform well on user-centric measurements involving confidence and human (dis)agreement-factors that reflect subjective preferences. While modeling subjectivity plays an essential role in cognitive science and has been extensively studied, its investigation at the intersection with NLP remains under-explored. In light of this gap, we explore how language models can quantify subjectivity in cognitive appraisal by conducting comprehensive experiments and analyses with both fine-tuned models and prompt-based large language models (LLMs). Our quantitative and qualitative results demonstrate that personality traits and demographic information are critical for measuring subjectivity, yet existing post-hoc calibration methods often fail to achieve satisfactory performance. Furthermore, our in-depth analysis provides valuable insights to guide future research at the intersection of NLP and cognitive science.
[102] Aligned Probing: Relating Toxic Behavior and Model Internals
Andreas Waldis, Vagrant Gautam, Anne Lauscher, Dietrich Klakow, Iryna Gurevych
Main category: cs.CL
TL;DR: Aligned probing is a new interpretability framework that aligns language model outputs with internal representations to study toxicity across 20+ models, revealing toxicity encoding patterns and practical applications.
Details
Motivation: To bridge behavioral and internal perspectives of language models for toxicity analysis, providing both correlative and causal evidence about how models generate toxic content.Method: Aligned probing framework that examines outputs and internal representations of over 20 OLMo, Llama, and Mistral models, with case studies on detoxification, multi-prompt evaluations, quantization, and pre-training dynamics.
Result: LMs strongly encode toxicity information in lower layers; models generate less toxic output when strongly encoding input toxicity; heterogeneity exists across toxicity attributes like Threat; practical insights from case studies.
Conclusion: The framework enables holistic understanding of LMs, contributing insights both within and beyond toxicity contexts, with demonstrated practical impact through various applications.
Abstract: We introduce aligned probing, a novel interpretability framework that aligns the behavior of language models (LMs), based on their outputs, and their internal representations (internals). Using this framework, we examine over 20 OLMo, Llama, and Mistral models, bridging behavioral and internal perspectives for toxicity for the first time. Our results show that LMs strongly encode information about the toxicity level of inputs and subsequent outputs, particularly in lower layers. Focusing on how unique LMs differ offers both correlative and causal evidence that they generate less toxic output when strongly encoding information about the input toxicity. We also highlight the heterogeneity of toxicity, as model behavior and internals vary across unique attributes such as Threat. Finally, four case studies analyzing detoxification, multi-prompt evaluations, model quantization, and pre-training dynamics underline the practical impact of aligned probing with further concrete insights. Our findings contribute to a more holistic understanding of LMs, both within and beyond the context of toxicity.
[103] Unifying Text Semantics and Graph Structures for Temporal Text-attributed Graphs with Large Language Models
Siwei Zhang, Yun Xiong, Yateng Tang, Jiarong Xu, Xi Chen, Zehao Gu, Xuezheng Hao, Zian Jia, Jiawei Zhang
Main category: cs.CL
TL;DR: CROSS is a framework that extends temporal graph neural networks to handle temporal text-attributed graphs by using LLMs to dynamically extract temporal semantics and unify them with structural information for improved performance.
Details
Motivation: Existing TGNNs treat text statically and prioritize structural information, ignoring the temporal evolution of text semantics and the interplay between semantics and structures in temporal text-attributed graphs.Method: CROSS decomposes TTAG modeling into two phases: (1) temporal semantics extraction using LLMs to understand evolving textual contexts, and (2) semantic-structural unification via a co-encoder that synthesizes representations considering both information types.
Result: CROSS achieves state-of-the-art results with 24.7% absolute MRR gain in temporal link prediction and 3.7% AUC gain in node classification across four public and one industrial dataset.
Conclusion: The framework effectively addresses the limitations of existing TGNNs by dynamically capturing temporal text semantics and unifying them with structural information, demonstrating significant performance improvements in TTAG modeling.
Abstract: Temporal graph neural networks (TGNNs) have shown remarkable performance in temporal graph modeling. However, real-world temporal graphs often possess rich textual information, giving rise to temporal text-attributed graphs (TTAGs). Such combination of dynamic text semantics and evolving graph structures introduces heightened complexity. Existing TGNNs embed texts statically and rely heavily on encoding mechanisms that biasedly prioritize structural information, overlooking the temporal evolution of text semantics and the essential interplay between semantics and structures for synergistic reinforcement. To tackle these issues, we present \textbf{CROSS}, a flexible framework that seamlessly extends existing TGNNs for TTAG modeling. CROSS is designed by decomposing the TTAG modeling process into two phases: (i) temporal semantics extraction; and (ii) semantic-structural information unification. The key idea is to advance the large language models (LLMs) to dynamically extract the temporal semantics in text space and then generate cohesive representations unifying both semantics and structures. Specifically, we propose a Temporal Semantics Extractor in the CROSS framework, which empowers LLMs to offer the temporal semantic understanding of node’s evolving contexts of textual neighborhoods, facilitating semantic dynamics. Subsequently, we introduce the Semantic-structural Co-encoder, which collaborates with the above Extractor for synthesizing illuminating representations by jointly considering both semantic and structural information while encouraging their mutual reinforcement. Extensive experiments show that CROSS achieves state-of-the-art results on four public datasets and one industrial dataset, with 24.7% absolute MRR gain on average in temporal link prediction and 3.7% AUC gain in node classification of industrial application.
[104] Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment
Ruoxi Cheng, Haoxuan Ma, Weixin Wang, Ranjie Duan, Jiexi Liu, Xiaoshuang Jia, Simeng Qin, Xiaochun Cao, Yang Liu, Xiaojun Jia
Main category: cs.CL
TL;DR: DR-IRL proposes a dynamic reward scaling method using inverse reinforcement learning to address imbalanced safety datasets and static reward models in LLM alignment.
Details
Motivation: Existing alignment techniques face challenges with imbalanced safety datasets (overrepresenting common hazards while neglecting long-tail threats) and static reward models that ignore task difficulty, limiting optimization efficiency.Method: Train category-specific reward models using balanced safety dataset covering 7 harmful categories via IRL, then enhance GRPO with dynamic reward scaling that adjusts rewards by task difficulty using text encoder cosine similarity (data-level) and reward gaps (model-level).
Result: Extensive experiments across various benchmarks and LLMs demonstrate that DR-IRL outperforms all baseline methods in safety alignment while maintaining usefulness.
Conclusion: DR-IRL effectively addresses key limitations in current LLM alignment approaches by dynamically adjusting rewards based on task difficulty, achieving superior safety performance without compromising model usefulness.
Abstract: Alignment is vital for safely deploying large language models (LLMs). Existing techniques are either reward-based (train a reward model on preference pairs and optimize with reinforcement learning) or reward-free (directly fine-tune on ranked outputs). Recent research shows that well-tuned reward-based pipelines remain robust, and single-response demonstrations can outperform pairwise preference data. However, two challenges persist: (1) imbalanced safety datasets that overrepresent common hazards while neglecting long-tail threats; and (2) static reward models that ignore task difficulty, limiting optimization efficiency and attainable gains. We propose DR-IRL (Dynamically adjusting Rewards through Inverse Reinforcement Learning). We first train category-specific reward models using a balanced safety dataset covering seven harmful categories via IRL. Then we enhance Group Relative Policy Optimization (GRPO) by introducing dynamic reward scaling–adjusting rewards by task difficulty–data-level hardness by text encoder cosine similarity, model-level responsiveness by reward gaps. Extensive experiments across various benchmarks and LLMs demonstrate that DR-IRL outperforms all baseline methods in safety alignment while maintaining usefulness.
[105] Playpen: An Environment for Exploring Learning Through Conversational Interaction
Nicola Horst, Davide Mazzaccara, Antonia Schmidt, Michael Sullivan, Filippo Momentè, Luca Franceschetti, Philipp Sadler, Sherzod Hakimov, Alberto Testoni, Raffaella Bernardi, Raquel Fernández, Alexander Koller, Oliver Lemon, David Schlangen, Mario Giulianelli, Alessandro Suglia
Main category: cs.CL
TL;DR: The paper investigates using Dialogue Games as feedback signals for post-training LLMs, introduces Playpen environment for self-play learning, and finds that interactive learning with GRPO shows balanced improvements without skill loss.
Details
Motivation: To explore whether Dialogue Games can serve as effective feedback signals for learning in LLMs, moving beyond traditional reward models for post-training.Method: Introduces Playpen environment for Dialogue Game self-play, tests three post-training methods: supervised fine-tuning (SFT), direct alignment (DPO), and reinforcement learning with GRPO on Llama-3.1-8B-Instruct.
Result: SFT improves performance on unseen game instances but negatively impacts other skills, while GRPO shows balanced improvements without skill loss.
Conclusion: Interactive learning through Dialogue Games is a promising direction for LLM post-training, with GRPO demonstrating effective balanced learning.
Abstract: Interaction between learner and feedback-giver has come into focus recently for post-training of Large Language Models (LLMs), through the use of reward models that judge the appropriateness of a model’s response. In this paper, we investigate whether Dialogue Games – goal-directed and rule-governed activities driven predominantly by verbal actions – can also serve as a source of feedback signals for learning. We introduce Playpen, an environment for off- and online learning through Dialogue Game self-play, and investigate a representative set of post-training methods: supervised fine-tuning; direct alignment (DPO); and reinforcement learning with GRPO. We experiment with post-training a small LLM (Llama-3.1-8B-Instruct), evaluating performance on unseen instances of training games as well as unseen games, and on standard benchmarks. We find that imitation learning through SFT improves performance on unseen instances, but negatively impacts other skills, while interactive learning with GRPO shows balanced improvements without loss of skills. We release the framework and the baseline training setups to foster research in the promising new direction of learning in (synthetic) interaction.
[106] Small or Large? Zero-Shot or Finetuned? Guiding Language Model Choice for Specialized Applications in Healthcare
Lovedeep Gondara, Jonathan Simkin, Graham Sayle, Shebnum Devji, Gregory Arbour, Raymond Ng
Main category: cs.CL
TL;DR: This study compares language model selection strategies for medical text classification, finding that fine-tuned Small Language Models (SLMs) consistently outperform zero-shot Large Language Models (LLMs) on specialized tasks, with domain-specific pretraining providing additional benefits.
Details
Motivation: To guide language model selection by investigating the necessity of finetuning vs. zero-shot usage, benefits of domain-adjacent vs. generic models, value of domain-specific pretraining, and continued relevance of SLMs compared to LLMs for specialized tasks.Method: Used electronic pathology reports from BC Cancer Registry to evaluate three classification scenarios with varying difficulty and data size. Compared various SLMs (both zero-shot and finetuned) against a zero-shot LLM, assessing domain-adjacent vs. generic models and domain-specific pretraining effects.
Result: Finetuning significantly improved SLM performance across all scenarios. Zero-shot LLM outperformed zero-shot SLMs but was consistently beaten by finetuned SLMs. Domain-adjacent SLMs performed better than generic SLM after finetuning, especially on harder tasks. Domain-specific pretraining provided modest gains on easy tasks but significant improvements on complex, data-scarce tasks.
Conclusion: SLMs remain highly relevant in the LLM era, offering superior performance-resource trade-off when appropriately finetuned for specialized domains. Finetuning is critical for SLMs to surpass zero-shot LLM performance, with domain-specific pretraining providing additional advantages for complex problems or limited data.
Abstract: This study aims to guide language model selection by investigating: 1) the necessity of finetuning versus zero-shot usage, 2) the benefits of domain-adjacent versus generic pretrained models, 3) the value of further domain-specific pretraining, and 4) the continued relevance of Small Language Models (SLMs) compared to Large Language Models (LLMs) for specific tasks. Using electronic pathology reports from the British Columbia Cancer Registry (BCCR), three classification scenarios with varying difficulty and data size are evaluated. Models include various SLMs and an LLM. SLMs are evaluated both zero-shot and finetuned; the LLM is evaluated zero-shot only. Finetuning significantly improved SLM performance across all scenarios compared to their zero-shot results. The zero-shot LLM outperformed zero-shot SLMs but was consistently outperformed by finetuned SLMs. Domain-adjacent SLMs generally performed better than the generic SLM after finetuning, especially on harder tasks. Further domain-specific pretraining yielded modest gains on easier tasks but significant improvements on the complex, data-scarce task. The results highlight the critical role of finetuning for SLMs in specialized domains, enabling them to surpass zero-shot LLM performance on targeted classification tasks. Pretraining on domain-adjacent or domain-specific data provides further advantages, particularly for complex problems or limited finetuning data. While LLMs offer strong zero-shot capabilities, their performance on these specific tasks did not match that of appropriately finetuned SLMs. In the era of LLMs, SLMs remain relevant and effective, offering a potentially superior performance-resource trade-off compared to LLMs.
[107] Meeseeks: A Feedback-Driven, Iterative Self-Correction Benchmark evaluating LLMs’ Instruction Following Capability
Jiaming wang, Yunke Zhao, Peng Ding, Jun Kuang, Yibin Shen, Zhe Tang, Yilin Jin, ZongYu Wang, Xiaoyu Li, Xuezhi Cao, Xunliang Cai
Main category: cs.CL
TL;DR: Meeseeks is an automated iterative instruction-following benchmark with integrated feedback that helps LLMs self-correct by identifying errors and providing guidance, revealing significant performance disparities among state-of-the-art models.
Details
Motivation: LLMs struggle to precisely follow complex instructions in single responses, requiring better methods to improve instruction adherence for real-world agent applications.Method: Developed Meeseeks benchmark with over 700 curated instances annotated by 32 capability tags in Chinese/English, using iterative feedback mechanism inspired by Chain-of-Thought and self-correction approaches.
Result: Different commercial and open-source LLMs show vastly disparate performance; even after 20 iterations of feedback-driven self-correction, most models remain suboptimal. Analysis revealed common issues and counterintuitive phenomena in current models.
Conclusion: Current LLMs have significant limitations in iterative instruction-following despite feedback mechanisms, highlighting the need for improved self-correction capabilities for reliable real-world agent applications.
Abstract: The capability to precisely adhere to instructions is a cornerstone for Large Language Models (LLMs) to function as dependable agents in real-world scenarios. However, confronted with complex prompts, LLMs frequently encounter difficulties in fulfilling all specified requirements within a single response. Drawing inspiration from recent advancements in Chain-of-Thought (CoT) prompting and self-correction methodologies, we introduce Meeseeks (The name is inspired by Mr. Meeseeks from “Rick and Morty,” a character renowned for efficiently accomplishing assigned tasks. See: https://en.wikipedia.org/wiki/Mr._Meeseeks), a fully automated iterative instruction-following benchmark equipped with an integrated feedback mechanism. Meeseeks identifies erroneous components in model responses and provides corresponding feedback accurately, thereby iteratively guiding the model toward self-correction. The dataset contains over 700 curated instances annotated by 32 distinct capability tags in Chinese and English. Extensive experimental results reveal that different state-of-the-art commercial and open-source LLMs exhibit vastly disparate performance, and even after 20 turns of iterative feedback-driven self-correction, nearly all models demonstrate suboptimal performance. We conducted comprehensive analysis from both macro and instance levels, uncovering numerous common issues prevalent in current state-of-the-art models, as well as several counterintuitive phenomena. We’ve open-sourced our work on https://github.com/ADoublLEN/Meeseeks.
[108] Scent of Knowledge: Optimizing Search-Enhanced Reasoning with Information Foraging
Hongjin Qian, Zheng Liu
Main category: cs.CL
TL;DR: InForage is a reinforcement learning framework that enables LLMs to dynamically interact with external retrieval tools during inference, treating retrieval-augmented reasoning as an adaptive information-seeking process inspired by Information Foraging Theory.
Details
Motivation: Traditional retrieval-augmented generation methods use static pre-inference retrieval, which is inadequate for complex tasks with ambiguous, multi-step, or evolving information needs. Recent advances in test-time scaling techniques motivate the shift toward adaptive inference-time retrieval.Method: InForage formalizes retrieval-augmented reasoning as a dynamic information-seeking process using reinforcement learning. It explicitly rewards intermediate retrieval quality and encourages iterative information gathering through adaptive search behaviors. A human-guided dataset was constructed to capture iterative search and reasoning trajectories.
Result: Extensive evaluations across general question answering, multi-hop reasoning tasks, and a new real-time web QA dataset demonstrate InForage’s superior performance over baseline methods.
Conclusion: InForage effectively builds robust, adaptive, and efficient reasoning agents by enabling LLMs to dynamically interact with external retrieval tools during inference.
Abstract: Augmenting large language models (LLMs) with external retrieval has become a standard method to address their inherent knowledge cutoff limitations. However, traditional retrieval-augmented generation methods employ static, pre-inference retrieval strategies, making them inadequate for complex tasks involving ambiguous, multi-step, or evolving information needs. Recent advances in test-time scaling techniques have demonstrated significant potential in enabling LLMs to dynamically interact with external tools, motivating the shift toward adaptive inference-time retrieval. Inspired by Information Foraging Theory (IFT), we propose InForage, a reinforcement learning framework that formalizes retrieval-augmented reasoning as a dynamic information-seeking process. Unlike existing approaches, InForage explicitly rewards intermediate retrieval quality, encouraging LLMs to iteratively gather and integrate information through adaptive search behaviors. To facilitate training, we construct a human-guided dataset capturing iterative search and reasoning trajectories for complex, real-world web tasks. Extensive evaluations across general question answering, multi-hop reasoning tasks, and a newly developed real-time web QA dataset demonstrate InForage’s superior performance over baseline methods. These results highlight InForage’s effectiveness in building robust, adaptive, and efficient reasoning agents.
[109] SAFE: Improving LLM Systems using Sentence-Level In-generation Attribution
João Eduardo Batista, Emil Vatai, Mohamed Wahib
Main category: cs.CL
TL;DR: SAFE is a sentence-level attribution framework for RAG systems that improves LLM trustworthiness by accurately attributing generated sentences to their source documents during generation.
Details
Motivation: Current LLMs lack reliable source attribution, making them unreliable for scientific and high-stakes applications where traceability and accountability are crucial.Method: A two-step framework: first predicts the required number of references for each sentence, then attributes the sentence to its source documents during generation.
Result: Achieved 95% accuracy in reference prediction and 2.1-6.0% improvements in attribution accuracy compared to top-1 methods; successfully generalized to real-world documents with hundreds to thousands of sentences.
Conclusion: SAFE provides verifiable sentence-level attribution that increases LLM safety and reliability, with the framework and dataset publicly available for broader adoption.
Abstract: Large Language Models (LLMs) are increasingly applied in various science domains, yet their broader adoption remains constrained by a critical challenge: the lack of trustworthy, verifiable outputs. Current LLMs often generate answers without reliable source attribution, or worse, with incorrect attributions, posing a barrier to their use in scientific and high-stakes settings, where traceability and accountability are paramount. To be reliable, attribution systems require high accuracy for short-length attribution on retrieved data, i.e., attribution to a sentence within a document rather than the entire document. We propose SAFE, a Sentence-level A ttribution FramEwork for Retrieve-Augmented Generation (RAG) systems that attributes generated sentences during generation. This allows users to verify sentences as they read them and correct the model when the attribution indicates the generated text is not grounded in the documents, increasing the safety of LLM systems. This framework consists of two steps: predicting the required number of references for a sentence, and attributing the sentence. Our approach achieved 95% accuracy in the first step, which translated to 2.1~6.0% improvements in the accuracy (normalized for maximum possible accuracy) of all attribution algorithms in our clean dataset, when compared to their top-1 accuracy. We also applied SAFE in real-world scenarios with documents containing hundreds to thousands of sentences. In these settings, SAFE reliably attributed sentences to their source documents, demonstrating that the method generalizes beyond controlled benchmarks. The SAFE framework and the training dataset are publicly available on GitHub.
[110] From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora
Yingli Shen, Wen Lai, Shuo Wang, Ge Gao, Kangyang Luo, Alexander Fraser, Maosong Sun
Main category: cs.CL
TL;DR: The paper introduces TED2025, a large-scale multi-way parallel corpus spanning 113 languages, and demonstrates that training LLMs on this aligned data outperforms using unaligned multilingual data across six benchmarks.
Details
Motivation: Unaligned multilingual data has limited ability to capture cross-lingual semantics, while multi-way parallel data provides stronger cross-lingual consistency and greater potential for improving multilingual performance.Method: Created TED2025 corpus based on TED Talks (113 languages, up to 50 languages aligned in parallel), then investigated best practices for leveraging this data through continued pretraining, instruction tuning, and analysis of key influencing factors.
Result: Experiments on six multilingual benchmarks show that models trained on multi-way parallel data consistently outperform those trained on unaligned multilingual data.
Conclusion: Multi-way parallel data is more effective than unaligned multilingual data for enhancing LLMs’ multilingual capabilities, with TED2025 serving as a valuable resource for cross-lingual model development.
Abstract: Continued pretraining and instruction tuning on large-scale multilingual data have proven to be effective in scaling large language models (LLMs) to low-resource languages. However, the unaligned nature of such data limits its ability to effectively capture cross-lingual semantics. In contrast, multi-way parallel data, where identical content is aligned across multiple languages, provides stronger cross-lingual consistency and offers greater potential for improving multilingual performance. In this paper, we introduce a large-scale, high-quality multi-way parallel corpus, TED2025, based on TED Talks. The corpus spans 113 languages, with up to 50 languages aligned in parallel, ensuring extensive multilingual coverage. Using this dataset, we investigate best practices for leveraging multi-way parallel data to enhance LLMs, including strategies for continued pretraining, instruction tuning, and the analysis of key influencing factors. Experiments on six multilingual benchmarks show that models trained on multiway parallel data consistently outperform those trained on unaligned multilingual data.
[111] DISCO Balances the Scales: Adaptive Domain- and Difficulty-Aware Reinforcement Learning on Imbalanced Data
Yuhang Zhou, Jing Zhu, Shengyi Qian, Zhuokai Zhao, Xiyao Wang, Xiaoyu Liu, Ming Li, Paiheng Xu, Wei Ai, Furong Huang
Main category: cs.CL
TL;DR: DISCO improves GRPO by addressing domain imbalance through domain-aware and difficulty-aware reward scaling, achieving better generalization and fairness in multi-domain LLM alignment.
Details
Motivation: GRPO assumes balanced domain distribution and uniform semantic alignment, which fails in real-world imbalanced datasets, causing poor generalization and fairness by over-optimizing dominant domains.Method: Domain-Informed Self-Consistency Policy Optimization (DISCO) extends GRPO with domain-aware reward scaling to counteract frequency bias and difficulty-aware reward scaling that prioritizes uncertain prompts using self-consistency.
Result: DISCO improves generalization, outperforms existing GRPO variants by 5% on Qwen3 models, and sets new SOTA results on multi-domain alignment benchmarks across multiple LLMs and skewed distributions.
Conclusion: DISCO provides a principled solution to domain imbalance in RLHF, enabling more equitable and effective policy learning through strategic reward scaling techniques.
Abstract: Large Language Models (LLMs) are increasingly aligned with human preferences through Reinforcement Learning from Human Feedback (RLHF). Among RLHF methods, Group Relative Policy Optimization (GRPO) has gained attention for its simplicity and strong performance, notably eliminating the need for a learned value function. However, GRPO implicitly assumes a balanced domain distribution and uniform semantic alignment across groups, assumptions that rarely hold in real-world datasets. When applied to multi-domain, imbalanced data, GRPO disproportionately optimizes for dominant domains, neglecting underrepresented ones and resulting in poor generalization and fairness. We propose Domain-Informed Self-Consistency Policy Optimization (DISCO), a principled extension to GRPO that addresses inter-group imbalance with two key innovations. Domain-aware reward scaling counteracts frequency bias by reweighting optimization based on domain prevalence. Difficulty-aware reward scaling leverages prompt-level self-consistency to identify and prioritize uncertain prompts that offer greater learning value. Together, these strategies promote more equitable and effective policy learning across domains. Extensive experiments across multiple LLMs and skewed training distributions show that DISCO improves generalization, outperforms existing GRPO variants by 5% on Qwen3 models, and sets new state-of-the-art results on multi-domain alignment benchmarks. Our code and data are available at https://github.com/Tonyzhou98/disco_grpo.
[112] Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning
Gagan Bhatia, Maxime Peyrard, Wei Zhao
Main category: cs.CL
TL;DR: This paper analyzes how BPE tokenizers fragment calendar dates, introduces a metric for date fragmentation, benchmarks temporal reasoning tasks, and discovers LLMs’ emergent date-abstraction mechanisms.
Details
Motivation: BPE tokenizers often split calendar dates into meaningless fragments, which inflates token counts and obscures temporal structure needed for robust reasoning. This fragmentation particularly affects uncommon dates like historical and futuristic dates.Method: The authors (1) introduce a date fragmentation ratio metric, (2) create DateAugBench with 6500 examples across three temporal reasoning tasks, and (3) use layer-wise probing and causal attention-hop analyses to study how LLMs process date fragments.
Result: Excessive fragmentation correlates with up to 10-point accuracy drops on uncommon dates. Larger models develop emergent date-abstraction mechanisms faster, and LLMs follow a reasoning path (year→month→day) that differs from human interpretation.
Conclusion: Date fragmentation significantly impacts temporal reasoning performance, but LLMs develop mechanisms to reassemble fragments. The findings highlight tokenization limitations and provide insights into how models learn to handle fragmented temporal information.
Abstract: Modern BPE tokenizers often split calendar dates into meaningless fragments, e.g., 20250312 $\rightarrow$ 202, 503, 12, inflating token counts and obscuring the inherent structure needed for robust temporal reasoning. In this work, we (1) introduce a simple yet interpretable metric, termed date fragmentation ratio, that measures how faithfully a tokenizer preserves multi-digit date components; (2) release DateAugBench, a suite of 6500 examples spanning three temporal reasoning tasks: context-based date resolution, format-invariance puzzles, and date arithmetic across historical, contemporary, and future time periods; and (3) through layer-wise probing and causal attention-hop analyses, uncover an emergent date-abstraction mechanism whereby large language models stitch together the fragments of month, day, and year components for temporal reasoning. Our experiments show that excessive fragmentation correlates with accuracy drops of up to 10 points on uncommon dates like historical and futuristic dates. Further, we find that the larger the model, the faster the emergent date abstraction that heals date fragments is accomplished. Lastly, we observe a reasoning path that LLMs follow to assemble date fragments, typically differing from human interpretation (year $\rightarrow$ month $\rightarrow$ day). Our datasets and code are made publicly available \href{https://github.com/gagan3012/date-fragments}{here}.
[113] Safeguarding Privacy of Retrieval Data against Membership Inference Attacks: Is This Query Too Close to Home?
Yujin Choi, Youngjoo Park, Junyoung Byun, Jaewook Lee, Jinseong Park
Main category: cs.CL
TL;DR: A similarity-based detection framework for membership inference attacks in retrieval-augmented generation systems, using a detect-and-hide strategy to protect private documents while maintaining utility.
Details
Motivation: Retrieval-augmented generation (RAG) systems are vulnerable to membership inference attacks (MIAs) that can determine whether specific data exists in private databases, compromising privacy.Method: Proposes a similarity-based MIA detection framework that identifies attack queries by their high similarity to single target documents, then implements a detect-and-hide strategy to obfuscate attackers.
Result: The method successfully defends against various state-of-the-art MIA techniques, maintains data utility, and adapts to existing RAG systems without requiring system modifications.
Conclusion: The proposed framework provides effective protection against membership inference attacks in RAG systems while preserving functionality and remaining system-agnostic.
Abstract: Retrieval-augmented generation (RAG) mitigates the hallucination problem in large language models (LLMs) and has proven effective for personalized usages. However, delivering private retrieved documents directly to LLMs introduces vulnerability to membership inference attacks (MIAs), which try to determine whether the target data point exists in the private external database or not. Based on the insight that MIA queries typically exhibit high similarity to only one target document, we introduce a novel similarity-based MIA detection framework designed for the RAG system. With the proposed method, we show that a simple detect-and-hide strategy can successfully obfuscate attackers, maintain data utility, and remain system-agnostic against MIA. We experimentally prove its detection and defense against various state-of-the-art MIA methods and its adaptability to existing RAG systems.
[114] LASER: Stratified Selective Sampling for Instruction Tuning with Dedicated Scoring Strategy
Paramita Mirza, Lucas Weber, Fabian Küch
Main category: cs.CL
TL;DR: A multi-step pipeline for efficient and universal data selection that uses binning, quality estimation, and difficulty scoring to enable high-performance fine-tuning with minimal overhead.
Details
Motivation: Data selection for LLMs often incurs high computational costs or is limited to narrow domains, but recent work shows post-training datasets can be substantially downsampled without performance deterioration.Method: Multi-step pipeline with efficient data binning, quality estimation using specialized models, difficulty scoring with lightweight method, task-based categorization for composition control, and improved diversity using embedding models and clustering algorithms.
Result: The integrated strategy enables high-performance fine-tuning with minimal overhead while maintaining data diversity and controlling composition for multi-purpose models.
Conclusion: Data selection can be both efficient and universal through this multi-step approach, allowing substantial dataset downsampling without compromising performance.
Abstract: Recent work shows that post-training datasets for LLMs can be substantially downsampled without noticeably deteriorating performance. However, data selection often incurs high computational costs or is limited to narrow domains. In this paper, we demonstrate that data selection can be both – efficient and universal – by using a multi-step pipeline in which we efficiently bin data points into groups, estimate quality using specialized models, and score difficulty with a robust, lightweight method. Task-based categorization allows us to control the composition of our final data – crucial for finetuning multi-purpose models. To guarantee diversity, we improve upon previous work using embedding models and a clustering algorithm. This integrated strategy enables high-performance fine-tuning with minimal overhead.
[115] Advancing Expert Specialization for Better MoE
Hongcan Guo, Haolang Lu, Guoshun Nan, Bolun Chu, Jialin Zhuang, Yuan Yang, Wenhao Che, Sicong Leng, Qimei Cui, Xudong Jiang
Main category: cs.CL
TL;DR: A method to improve Mixture-of-Experts (MoE) models by introducing orthogonality and variance losses to enhance expert specialization and discriminative routing, achieving up to 23.79% improvement over baseline MoE models.
Details
Motivation: Current MoE models using auxiliary load balancing loss lead to expert overlap and uniform routing, which hinders expert specialization and degrades performance during post-training.Method: Proposes two complementary objectives: orthogonality loss to encourage experts to process distinct token types, and variance loss to encourage more discriminative routing decisions. These are compatible with existing auxiliary loss.
Result: Experimental results show significant enhancement in expert specialization across various model architectures and benchmarks. Improves classic MoE baselines by up to 23.79% while maintaining load balancing in downstream tasks.
Conclusion: The proposed method effectively addresses expert overlap and uniform routing issues in MoE models without architectural modifications, leading to substantial performance improvements.
Abstract: Mixture-of-Experts (MoE) models enable efficient scaling of large language models (LLMs) by activating only a subset of experts per input. However, we observe that the commonly used auxiliary load balancing loss often leads to expert overlap and overly uniform routing, which hinders expert specialization and degrades overall performance during post-training. To address this, we propose a simple yet effective solution that introduces two complementary objectives: (1) an orthogonality loss to encourage experts to process distinct types of tokens, and (2) a variance loss to encourage more discriminative routing decisions. Gradient-level analysis demonstrates that these objectives are compatible with the existing auxiliary loss and contribute to optimizing the training process. Experimental results over various model architectures and across multiple benchmarks show that our method significantly enhances expert specialization. Notably, our method improves classic MoE baselines with auxiliary loss by up to 23.79%, while also maintaining load balancing in downstream tasks, without any architectural modifications or additional components. We will release our code to contribute to the community.
[116] DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors
Yize Cheng, Wenxiao Wang, Mazda Moayeri, Soheil Feizi
Main category: cs.CL
TL;DR: DyePack is a framework that uses backdoor attacks to detect test set contamination in large language models, providing provable false positive rate guarantees without requiring access to model internals.
Details
Motivation: Open benchmarks are vulnerable to test set contamination since their accessibility makes them likely targets for improper training data inclusion, compromising evaluation integrity.Method: The framework mixes backdoor samples with test data using multiple backdoors with stochastic targets, enabling exact false positive rate computation when flagging contaminated models.
Result: Successfully detected all contaminated models with extremely low false positive rates: 0.000073% on MMLU-Pro, 0.000017% on Big-Bench-Hard (multiple-choice), and 0.127% on Alpaca (open-ended generation).
Conclusion: DyePack provides a practical and provably reliable method for detecting benchmark contamination, ensuring evaluation integrity while preventing false accusations.
Abstract: Open benchmarks are essential for evaluating and advancing large language models, offering reproducibility and transparency. However, their accessibility makes them likely targets of test set contamination. In this work, we introduce DyePack, a framework that leverages backdoor attacks to identify models that used benchmark test sets during training, without requiring access to the loss, logits, or any internal details of the model. Like how banks mix dye packs with their money to mark robbers, DyePack mixes backdoor samples with the test data to flag models that trained on it. We propose a principled design incorporating multiple backdoors with stochastic targets, enabling exact false positive rate (FPR) computation when flagging every model. This provably prevents false accusations while providing strong evidence for every detected case of contamination. We evaluate DyePack on five models across three datasets, covering both multiple-choice and open-ended generation tasks. For multiple-choice questions, it successfully detects all contaminated models with guaranteed FPRs as low as 0.000073% on MMLU-Pro and 0.000017% on Big-Bench-Hard using eight backdoors. For open-ended generation tasks, it generalizes well and identifies all contaminated models on Alpaca with a guaranteed false positive rate of just 0.127% using six backdoors.
[117] Threading the Needle: Reweaving Chain-of-Thought Reasoning to Explain Human Label Variation
Beiduo Chen, Yang Janet Liu, Anna Korhonen, Barbara Plank
Main category: cs.CL
TL;DR: Proposes a novel LLM-based pipeline using linguistically-grounded discourse segmenters to extract supporting/opposing statements from chains of thought (CoTs) for human label variation analysis, with a rank-based evaluation framework.
Details
Motivation: To better understand human label variation by leveraging reasoning-tuned LLMs' chain of thought capabilities, which provide forward reasoning paths that implicitly embed rationales for each answer option before generating final answers.Method: Uses linguistically-grounded discourse segmenters to extract supporting and opposing statements for each answer option from LLM-generated chains of thought, and proposes a rank-based evaluation framework that prioritizes answer ranking over exact scores.
Result: Outperforms direct generation methods and baselines on three datasets, showing better alignment of ranking methods with human judgments.
Conclusion: The approach effectively leverages CoTs for human label variation analysis, demonstrating improved accuracy and better human alignment compared to traditional methods.
Abstract: The recent rise of reasoning-tuned Large Language Models (LLMs)–which generate chains of thought (CoTs) before giving the final answer–has attracted significant attention and offers new opportunities for gaining insights into human label variation, which refers to plausible differences in how multiple annotators label the same data instance. Prior work has shown that LLM-generated explanations can help align model predictions with human label distributions, but typically adopt a reverse paradigm: producing explanations based on given answers. In contrast, CoTs provide a forward reasoning path that may implicitly embed rationales for each answer option, before generating the answers. We thus propose a novel LLM-based pipeline enriched with linguistically-grounded discourse segmenters to extract supporting and opposing statements for each answer option from CoTs with improved accuracy. We also propose a rank-based HLV evaluation framework that prioritizes the ranking of answers over exact scores, which instead favor direct comparison of label distributions. Our method outperforms a direct generation method as well as baselines on three datasets, and shows better alignment of ranking methods with humans, highlighting the effectiveness of our approach.
[118] LegalSearchLM: Rethinking Legal Case Retrieval as Legal Elements Generation
Chaeeun Kim, Jinu Lee, Wonseok Hwang
Main category: cs.CL
TL;DR: The paper introduces LEGAR BENCH, a large-scale Korean Legal Case Retrieval benchmark, and LegalSearchLM, a retrieval model that performs legal element reasoning and constrained decoding to improve case retrieval performance.
Details
Motivation: Existing Legal Case Retrieval (LCR) studies face limitations: small-scale corpora (100-55K cases), narrow criminal query types, and reliance on embedding-based/lexical matching methods that produce limited representations and legally irrelevant matches.Method: LegalSearchLM performs legal element reasoning over query cases and directly generates content containing those elements, grounded in target cases through constrained decoding. The model is evaluated on LEGAR BENCH, covering 411 crime types over 1.2M candidate cases.
Result: LegalSearchLM outperforms baselines by 6-20% on LEGAR BENCH, achieving state-of-the-art performance. It also demonstrates strong generalization to out-of-domain cases, outperforming naive generative models trained on in-domain data by 15%.
Conclusion: The proposed LegalSearchLM model with legal element reasoning and constrained decoding effectively addresses limitations of existing LCR methods, showing significant performance improvements and strong generalization capabilities on the large-scale LEGAR BENCH.
Abstract: Legal Case Retrieval (LCR), which retrieves relevant cases from a query case, is a fundamental task for legal professionals in research and decision-making. However, existing studies on LCR face two major limitations. First, they are evaluated on relatively small-scale retrieval corpora (e.g., 100-55K cases) and use a narrow range of criminal query types, which cannot sufficiently reflect the complexity of real-world legal retrieval scenarios. Second, their reliance on embedding-based or lexical matching methods often results in limited representations and legally irrelevant matches. To address these issues, we present: (1) LEGAR BENCH, the first large-scale Korean LCR benchmark, covering 411 diverse crime types in queries over 1.2M candidate cases; and (2) LegalSearchLM, a retrieval model that performs legal element reasoning over the query case and directly generates content containing those elements, grounded in the target cases through constrained decoding. Experimental results show that LegalSearchLM outperforms baselines by 6-20% on LEGAR BENCH, achieving state-of-the-art performance. It also demonstrates strong generalization to out-of-domain cases, outperforming naive generative models trained on in-domain data by 15%.
[119] RadialRouter: Structured Representation for Efficient and Robust Large Language Models Routing
Ruihan Jin, Pengpeng Shao, Zhengqi Wen, Jinyang Wu, Mingkuan Feng, Shuai Zhang, Jianhua Tao
Main category: cs.CL
TL;DR: RadialRouter is a novel LLM routing framework that uses a lightweight Transformer-based backbone with radial structure to better connect user queries with LLM characteristics, achieving significant performance improvements over existing methods.
Details
Motivation: Current LLM routing methods are limited due to insufficient exploration of the intrinsic connection between user queries and LLM characteristics, leading to suboptimal model selection for specific tasks.Method: Proposes RadialRouter framework with RadialFormer (lightweight Transformer-based backbone with radial structure) to articulate query-LLMs relationship, using KL divergence combined with query-query contrastive loss for robust optimization.
Result: Outperforms existing routing methods by 9.2% in Balance scenario and 5.8% in Cost First scenario on RouterBench, demonstrating strong adaptability to different performance-cost trade-offs and dynamic LLM pools.
Conclusion: RadialRouter provides an effective solution for LLM routing that significantly improves performance while maintaining practical application potential through its adaptability and robustness.
Abstract: The rapid advancements in large language models (LLMs) have led to the emergence of routing techniques, which aim to efficiently select the optimal LLM from diverse candidates to tackle specific tasks, optimizing performance while reducing costs. Current LLM routing methods are limited in effectiveness due to insufficient exploration of the intrinsic connection between user queries and the characteristics of LLMs. To address this issue, in this paper, we present RadialRouter, a novel framework for LLM routing which employs a lightweight Transformer-based backbone with a radial structure named RadialFormer to articulate the query-LLMs relationship. The optimal LLM selection is performed based on the final states of RadialFormer. The pipeline is further refined by an objective function that combines Kullback-Leibler divergence with the query-query contrastive loss to enhance robustness. Experimental results on RouterBench show that RadialRouter significantly outperforms existing routing methods by 9.2% and 5.8% in the Balance and Cost First scenarios, respectively. Additionally, its adaptability toward different performance-cost trade-offs and the dynamic LLM pool demonstrates practical application potential.
[120] How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts?
Sohee Yang, Sang-Woo Lee, Nora Kassner, Daniela Gottesman, Sebastian Riedel, Mor Geva
Main category: cs.CL
TL;DR: Models can identify unhelpful thoughts but struggle to recover from them when injected into reasoning, showing limitations in self-reevaluation capabilities.
Details
Motivation: To investigate how effectively reasoning models can perform self-reevaluation by identifying and recovering from four types of unhelpful thoughts.Method: Injecting four types of unhelpful thoughts (rambling, irrelevant, misdirecting, incorrect) into models’ reasoning processes and evaluating their ability to identify and recover from them.
Result: Models effectively identify most unhelpful thoughts but perform poorly at recovery, with larger models struggling more than smaller ones. Inverse scaling observed where larger models are more distracted by irrelevant thoughts.
Conclusion: Current reasoning models lack robust self-reevaluation capabilities, calling for improvements to develop better reasoning and safer AI systems.
Abstract: Recent reasoning models show the ability to reflect, backtrack, and self-validate their reasoning, which is crucial in spotting mistakes and arriving at accurate solutions. A natural question that arises is how effectively models can perform such self-reevaluation. We tackle this question by investigating how well reasoning models identify and recover from four types of unhelpful thoughts: uninformative rambling thoughts, thoughts irrelevant to the question, thoughts misdirecting the question as a slightly different question, and thoughts that lead to incorrect answers. We show that models are effective at identifying most unhelpful thoughts but struggle to recover from the same thoughts when these are injected into their thinking process, causing significant performance drops. Models tend to naively continue the line of reasoning of the injected irrelevant thoughts, which showcases that their self-reevaluation abilities are far from a general “meta-cognitive” awareness. Moreover, we observe non/inverse-scaling trends, where larger models struggle more than smaller ones to recover from short irrelevant thoughts, even when instructed to reevaluate their reasoning. We demonstrate the implications of these findings with a jailbreak experiment using irrelevant thought injection, showing that the smallest models are the least distracted by harmful-response-triggering thoughts. Overall, our findings call for improvement in self-reevaluation of reasoning models to develop better reasoning and safer systems.
[121] Augmenting Multi-Agent Communication with State Delta Trajectory
Yichen Tang, Weihang Su, Yujia Zhou, Yiqun Liu, Min Zhang, Shaoping Ma, Qingyao Ai
Main category: cs.CL
TL;DR: Proposes State Delta Encoding (SDE) to improve multi-agent LLM communication by transferring both natural language tokens and token-wise state transition trajectories, achieving SOTA performance in complex reasoning tasks.
Details
Motivation: Existing multi-agent systems using natural language communication suffer from information loss when transferring reasoning logics or abstract thoughts, as continuous state vectors must be down-sampled to discrete tokens.Method: Introduces a new communication protocol that transfers natural language tokens along with token-wise state transition trajectories using State Delta Encoding (SDE), which captures state changes after each token generation to reveal hidden inference process information.
Result: Multi-agent systems with SDE achieve state-of-the-art performance compared to other communication protocols, particularly excelling in tasks involving complex reasoning.
Conclusion: The proposed SDE method effectively addresses information loss in multi-agent communication by preserving state transition trajectories, leading to superior performance in complex reasoning scenarios.
Abstract: Multi-agent techniques such as role playing or multi-turn debates have been shown to be effective in improving the performance of large language models (LLMs) in downstream tasks. Despite their differences in workflows, existing multi-agent systems constructed from a single base LLM mostly use natural language for agent communication. While this is appealing for its simplicity and interpretability, it also introduces inevitable information loss as one model must down sample its continuous state vectors to discrete tokens before transferring them to the other model. Such losses are particularly significant when the information to transfer is not simple facts, but reasoning logics or abstractive thoughts. To tackle this problem, we propose a new communication protocol that transfers both natural language tokens and token-wise state transition trajectory from one agent to another. Particularly, compared to the actual state value, we find that the sequence of state changes in LLMs after generating each token can better reflect the information hidden behind the inference process. We propose a State Delta Encoding (SDE) method to represent state transition trajectories. The experimental results show that multi-agent systems with SDE achieve SOTA performance compared to other communication protocols, particularly in tasks that involve complex reasoning.
[122] The Medium Is Not the Message: Deconfounding Document Embeddings via Linear Concept Erasure
Yu Fan, Yang Tian, Shauli Ravfogel, Mrinmaya Sachan, Elliott Ash, Alexander Hoyle
Main category: cs.CL
TL;DR: A debiasing algorithm that removes information about observed confounders from encoder representations substantially reduces biases in text embeddings at minimal computational cost, improving similarity and clustering metrics without degrading out-of-distribution performance.
Details
Motivation: Embedding-based similarity metrics can be biased by spurious attributes like text source or language, which are problematic for applications that need to pool texts from different corpora.Method: A debiasing algorithm that removes information about observed confounders from the encoder representations.
Result: Substantial reduction in biases across every embedding variant and task evaluated, with improved document similarity and clustering metrics. Out-of-distribution benchmark performance remains unaffected.
Conclusion: The debiasing approach effectively reduces confounder-induced biases in text embeddings while maintaining overall embedding quality, making it valuable for applications involving pooled text corpora.
Abstract: Embedding-based similarity metrics between text sequences can be influenced not just by the content dimensions we most care about, but can also be biased by spurious attributes like the text’s source or language. These document confounders cause problems for many applications, but especially those that need to pool texts from different corpora. This paper shows that a debiasing algorithm that removes information about observed confounders from the encoder representations substantially reduces these biases at a minimal computational cost. Document similarity and clustering metrics improve across every embedding variant and task we evaluate – often dramatically. Interestingly, performance on out-of-distribution benchmarks is not impacted, indicating that the embeddings are not otherwise degraded.
[123] Detecting Token-Level Hallucinations Using Variance Signals: A Reference-Free Approach
Keshav Kumar
Main category: cs.CL
TL;DR: A reference-free, token-level hallucination detection framework that uses variance in token log-probabilities across multiple stochastic generations to identify hallucinations in LLMs.
Details
Motivation: LLMs generate impressive outputs but are susceptible to hallucinations - factually incorrect but confidently generated responses. Existing methods require ground-truth references or sentence-level verification, which limits their applicability.Method: Leverages variance in token log-probabilities across multiple stochastic generations from the same model. The approach is model-agnostic, interpretable, and works for real-time or post-hoc analysis without needing reference data.
Result: Evaluated on unanswerable question prompts from SQuAD v2 dataset across three models (GPT-Neo 125M, Falcon 1B, Mistral 7B). Token-level variance reliably highlights instability in model outputs and correlates with hallucination patterns through quantitative metrics and visual diagnostics.
Conclusion: The framework is lightweight, reproducible, adaptable to multiple domains, and offers a valuable diagnostic tool for analyzing generative reliability in LLMs without requiring ground-truth references.
Abstract: Large Language Models (LLMs) have demonstrated impressive generative capabilities across diverse tasks but remain susceptible to hallucinations, confidently generated yet factually incorrect outputs. We introduce a reference-free, token-level hallucination detection framework that leverages the variance in token log-probabilities across multiple stochastic generations. Unlike prior methods that require ground-truth references or sentence-level verification, our approach is model-agnostic, interpretable, and suited for real-time or post-hoc analysis. We evaluate our method on unanswerable question prompts from the SQuAD v2 dataset and benchmark across three autoregressive models of varying scales: GPT-Neo 125M, Falcon 1B, and Mistral 7B. Through both quantitative metrics and visual diagnostics, we show that token-level variance reliably highlights instability in model outputs and correlates with hallucination patterns. Our framework is lightweight, reproducible, and adaptable to multiple domains, offering a valuable diagnostic tool for analyzing generative reliability in LLMs.
[124] VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation
Ziang Ye, Yang Zhang, Wentao Shi, Xiaoyu You, Fuli Feng, Tat-Seng Chua
Main category: cs.CL
TL;DR: VisualTrap is a backdoor attack method that exploits vulnerabilities in GUI agents’ visual grounding by misleading them to misinterpret textual plans and click on trigger locations instead of intended targets, requiring only 5% poisoned data and remaining effective across different GUI environments.
Details
Motivation: GUI agents powered by Large Vision-Language Models are increasingly used for automating human-machine interactions, but their integration with personal devices raises significant security concerns, particularly unexplored backdoor attack vulnerabilities in the visual grounding process.Method: VisualTrap injects poisoned data during pre-training of visual grounding to hijack the agent’s ability to map textual plans to GUI elements. It uses stealthy visual triggers invisible to humans and requires minimal poisoned data (as low as 5%).
Result: The attack effectively compromises agent behavior even with correct task-solving plans, generalizes to downstream tasks after clean fine-tuning, and transfers across different GUI environments (mobile/web to desktop).
Conclusion: The findings reveal critical security vulnerabilities in GUI agents’ visual grounding and underscore the urgent need for research on backdoor attack risks in these systems.
Abstract: Graphical User Interface (GUI) agents powered by Large Vision-Language Models (LVLMs) have emerged as a revolutionary approach to automating human-machine interactions, capable of autonomously operating personal devices (e.g., mobile phones) or applications within the device to perform complex real-world tasks in a human-like manner. However, their close integration with personal devices raises significant security concerns, with many threats, including backdoor attacks, remaining largely unexplored. This work reveals that the visual grounding of GUI agent-mapping textual plans to GUI elements-can introduce vulnerabilities, enabling new types of backdoor attacks. With backdoor attack targeting visual grounding, the agent’s behavior can be compromised even when given correct task-solving plans. To validate this vulnerability, we propose VisualTrap, a method that can hijack the grounding by misleading the agent to locate textual plans to trigger locations instead of the intended targets. VisualTrap uses the common method of injecting poisoned data for attacks, and does so during the pre-training of visual grounding to ensure practical feasibility of attacking. Empirical results show that VisualTrap can effectively hijack visual grounding with as little as 5% poisoned data and highly stealthy visual triggers (invisible to the human eye); and the attack can be generalized to downstream tasks, even after clean fine-tuning. Moreover, the injected trigger can remain effective across different GUI environments, e.g., being trained on mobile/web and generalizing to desktop environments. These findings underscore the urgent need for further research on backdoor attack risks in GUI agents.
[125] Dynamic Parameter Memory: Temporary LoRA-Enhanced LLM for Long-Sequence Emotion Recognition in Conversation
Jialong Mai, Xiaofen Xing, Yawei Li, Weidong Chen, Zhipeng Li, Jingyuan Xing, Xiangmin Xu
Main category: cs.CL
TL;DR: Proposes Dynamic Parameter Memory (DPM) mechanism to enable SLLMs to process unlimited-length audio by progressively encoding sentence-level information and emotions into temporary LoRA modules during inference.
Details
Motivation: Current SLLMs have limited context windows that restrict audio processing capabilities (e.g., 4K context window only handles 80 seconds at 50Hz), and existing compression methods ignore emotion continuity across conversation turns.Method: DPM mechanism with contextual semantics and sentence-level emotion encoding that progressively stores information in temporary LoRA modules during inference, trained on emotion SLLM backbone for ERC tasks.
Result: Experimental results on IEMOCAP dataset show DPM significantly improves emotion recognition for long audio sequences, achieving state-of-the-art performance.
Conclusion: DPM effectively enables SLLMs to handle unlimited-length audio with limited context windows by memorizing contextual information through progressive encoding.
Abstract: Recent research has focused on applying speech large language model (SLLM) to improve speech emotion recognition (SER). However, the inherently high frame rate in speech modality severely limits the signal processing and understanding capabilities of SLLM. For example, a SLLM with a 4K context window can only process 80 seconds of audio at 50Hz feature sampling rate before reaching its capacity limit. Input token compression methods used in SLLM overlook the continuity and inertia of emotions across multiple conversation turns. This paper proposes a Dynamic Parameter Memory (DPM) mechanism with contextual semantics and sentence-level emotion encoding, enabling processing of unlimited-length audio with limited context windows in SLLM. Specifically, DPM progressively encodes sentence-level information and emotions into a temporary LoRA module during inference to effectively “memorize” the contextual information. We trained an emotion SLLM as a backbone and incorporated our DPM into inference for emotion recognition in conversation (ERC). Experimental results on the IEMOCAP dataset show that DPM significantly improves the emotion recognition capabilities of SLLM when processing long audio sequences, achieving state-of-the-art performance.
[126] Resource-Efficient Adaptation of Large Language Models for Text Embeddings via Prompt Engineering and Contrastive Fine-tuning
Benedikt Roth, Stephan Rappensperger, Tianming Qiu, Hamza Imamović, Julian Wörmann, Hao Shen
Main category: cs.CL
TL;DR: LLMs can be effectively adapted for text embedding tasks through prompt engineering and contrastive fine-tuning, achieving competitive performance on MTEB clustering benchmark.
Details
Motivation: LLMs have rich token-level semantics but pooling into embeddings loses crucial information, while many downstream tasks need accurate sentence/document embeddings.Method: Three adaptation strategies: (i) token embedding aggregation techniques, (ii) task-specific prompt engineering, (iii) contrastive fine-tuning with synthetic positive pairs.
Result: Competitive performance on MTEB English clustering track; attention analysis shows fine-tuning shifts focus from prompt tokens to semantically relevant words.
Conclusion: LLMs can be effectively adapted as text embedding models through prompt engineering and resource-efficient contrastive fine-tuning.
Abstract: Large Language Models (LLMs) have become a cornerstone in Natural Language Processing (NLP), achieving impressive performance in text generation. Their token-level representations capture rich, human-aligned semantics. However, pooling these vectors into a text embedding discards crucial information. Nevertheless, many non-generative downstream tasks, such as clustering, classification, or retrieval, still depend on accurate and controllable sentence- or document-level embeddings. We explore several adaptation strategies for pre-trained, decoder-only LLMs: (i) various aggregation techniques for token embeddings, (ii) task-specific prompt engineering, and (iii) text-level augmentation via contrastive fine-tuning. Combining these components yields competitive performance on the English clustering track of the Massive Text Embedding Benchmark (MTEB). An analysis of the attention map further shows that fine-tuning shifts focus from prompt tokens to semantically relevant words, indicating more effective compression of meaning into the final hidden state. Our experiments demonstrate that LLMs can be effectively adapted as text embedding models through a combination of prompt engineering and resource-efficient contrastive fine-tuning on synthetically generated positive pairs.
[127] Enhancing RAG Efficiency with Adaptive Context Compression
Shuyu Guo, Shuo Zhang, Zhaochun Ren
Main category: cs.CL
TL;DR: ACC-RAG is an adaptive context compression framework for retrieval-augmented generation that dynamically adjusts compression rates based on query complexity to optimize inference efficiency while maintaining accuracy.
Details
Motivation: Standard RAG incurs high inference costs from lengthy retrieved contexts, and existing compression methods use fixed rates that either over-compress simple queries or under-compress complex ones.Method: Combines a hierarchical compressor for multi-granular embeddings with a context selector to retain minimal sufficient information, similar to human skimming behavior.
Result: Outperforms fixed-rate compression methods and achieves over 4x faster inference than standard RAG while maintaining or improving accuracy on Wikipedia and five QA datasets.
Conclusion: ACC-RAG provides an effective solution for optimizing RAG inference efficiency through adaptive compression that matches query complexity, enabling significant speedups without accuracy loss.
Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but incurs significant inference costs due to lengthy retrieved contexts. While context compression mitigates this issue, existing methods apply fixed compression rates, over-compressing simple queries or under-compressing complex ones. We propose Adaptive Context Compression for RAG (ACC-RAG), a framework that dynamically adjusts compression rates based on input complexity, optimizing inference efficiency without sacrificing accuracy. ACC-RAG combines a hierarchical compressor (for multi-granular embeddings) with a context selector to retain minimal sufficient information, akin to human skimming. Evaluated on Wikipedia and five QA datasets, ACC-RAG outperforms fixed-rate methods and matches/unlocks over 4 times faster inference versus standard RAG while maintaining or improving accuracy.
[128] From Query to Logic: Ontology-Driven Multi-Hop Reasoning in LLMs
Haonan Bian, Yutao Qi, Rui Yang, Yuanxi Che, Jiaqian Wang, Heming Xia, Ranran Zhen
Main category: cs.CL
TL;DR: ORACLE is a training-free framework that enhances LLMs’ multi-hop question answering by combining generative capabilities with knowledge graph structures through dynamic ontology construction, First-Order Logic reasoning chains, and systematic query decomposition.
Details
Motivation: LLMs struggle with complex multi-hop question answering tasks due to their inability to capture deep conceptual relationships between entities, requiring non-linear, structured reasoning.Method: Three-stage approach: (1) dynamic construction of question-specific knowledge ontologies using LLMs, (2) transformation into First-Order Logic reasoning chains, (3) systematic decomposition of original queries into logically coherent sub-questions.
Result: Achieves highly competitive performance on standard MQA benchmarks, rivaling state-of-the-art models like DeepSeek-R1, with more logical and interpretable reasoning chains.
Conclusion: ORACLE effectively bridges LLMs’ generative strengths with structured reasoning, demonstrating significant improvements in multi-hop question answering through interpretable logical frameworks.
Abstract: Large Language Models (LLMs), despite their success in question answering, exhibit limitations in complex multi-hop question answering (MQA) tasks that necessitate non-linear, structured reasoning. This limitation stems from their inability to adequately capture deep conceptual relationships between entities. To overcome this challenge, we present ORACLE (Ontology-driven Reasoning And Chain for Logical Eucidation), a training-free framework that combines LLMs’ generative capabilities with the structural benefits of knowledge graphs. Our approach operates through three stages: (1) dynamic construction of question-specific knowledge ontologies using LLMs, (2) transformation of these ontologies into First-Order Logic reasoning chains, and (3) systematic decomposition of the original query into logically coherent sub-questions. Experimental results on several standard MQA benchmarks show that our framework achieves highly competitive performance, rivaling current state-of-the-art models like DeepSeek-R1. Detailed analyses further confirm the effectiveness of each component, while demonstrating that our method generates more logical and interpretable reasoning chains than existing approaches.
[129] SciRerankBench: Benchmarking Rerankers Towards Scientific Retrieval-Augmented Generated LLMs
Haotian Chen, Qingqing Long, Meng Xiao, Xiao Luo, Wei Ju, Chengrui Wang, Xuezhi Wang, Yuanchun Zhou, Hengshu Zhu
Main category: cs.CL
TL;DR: SciRerankBench is the first benchmark specifically designed to evaluate rerankers within RAG-LLM systems for scientific literature question answering, covering five scientific subjects and testing noise resilience, relevance disambiguation, and factual consistency.
Details
Motivation: Two-stage RAG-LLMs have shown impressive advancements in scientific QA, but their rerankers' potential and limitations remain unexplored, especially given that subtle terminology differences can greatly impact factual accuracy in scientific domains.Method: Developed SciRerankBench with three types of Q-C-A pairs (Noisy Contexts, Semantically Similar but Logically Irrelevant Contexts, Counterfactual Contexts) to systematically evaluate 13 rerankers across five LLM families on five scientific subjects.
Result: The benchmark provides detailed insights into rerankers’ relative strengths and limitations through systematic evaluation, though specific quantitative results are not provided in the abstract.
Conclusion: SciRerankBench is the first specialized benchmark for evaluating rerankers in RAG-LLMs, offering valuable observations and guidance for future development in scientific question answering systems.
Abstract: Scientific literature question answering is a pivotal step towards new scientific discoveries. Recently, \textit{two-stage} retrieval-augmented generated large language models (RAG-LLMs) have shown impressive advancements in this domain. Such a two-stage framework, especially the second stage (reranker), is particularly essential in the scientific domain, where subtle differences in terminology may have a greatly negative impact on the final factual-oriented or knowledge-intensive answers. Despite this significant progress, the potential and limitations of these works remain unexplored. In this work, we present a Scientific Rerank-oriented RAG Benchmark (SciRerankBench), for evaluating rerankers within RAG-LLMs systems, spanning five scientific subjects. To rigorously assess the reranker performance in terms of noise resilience, relevance disambiguation, and factual consistency, we develop three types of question-context-answer (Q-C-A) pairs, i.e., Noisy Contexts (NC), Semantically Similar but Logically Irrelevant Contexts (SSLI), and Counterfactual Contexts (CC). Through systematic evaluation of 13 widely used rerankers on five families of LLMs, we provide detailed insights into their relative strengths and limitations. To the best of our knowledge, SciRerankBench is the first benchmark specifically developed to evaluate rerankers within RAG-LLMs, which provides valuable observations and guidance for their future development.
[130] Culture is Everywhere: A Call for Intentionally Cultural Evaluation
Juhyun Oh, Inha Cha, Michael Saxon, Hyunseung Lim, Shaily Bhatt, Alice Oh
Main category: cs.CL
TL;DR: The paper critiques current cultural evaluation methods for LLMs as inadequate and proposes a new approach called “intentionally cultural evaluation” that examines cultural assumptions in all aspects of evaluation, not just explicitly cultural tasks.
Details
Motivation: Current evaluation methods reduce culture to static facts or values, treating it as isolated trivia through multiple-choice questions, which neglects the pluralistic and interactive nature of culture and overlooks how cultural assumptions permeate even "neutral" evaluation settings.Method: The authors systematically characterize the what, how, and circumstances of culturally contingent considerations in evaluation, emphasizing researcher positionality and proposing HCI-inspired participatory methodologies for involving communities in evaluation design.
Result: The paper provides a framework for moving beyond current benchmarking practices and discovering important applications that may not yet be recognized, advocating for more inclusive and culturally aligned NLP research.
Conclusion: The position paper argues for a paradigm shift towards intentionally cultural evaluation that systematically examines embedded cultural assumptions throughout the evaluation process, rather than treating culture as isolated trivia.
Abstract: The prevailing trivia-centered paradigm'' for evaluating the cultural alignment of large language models (LLMs) is increasingly inadequate as these models become more advanced and widely deployed. Existing approaches typically reduce culture to static facts or values, testing models via multiple-choice or short-answer questions that treat culture as isolated trivia. Such methods neglect the pluralistic and interactive realities of culture, and overlook how cultural assumptions permeate even ostensibly
neutral’’ evaluation settings.
In this position paper, we argue for \textbf{intentionally cultural
evaluation}: an approach that systematically examines the cultural assumptions
embedded in all aspects of evaluation, not just in explicitly cultural tasks.
We systematically characterize the what, how, and circumstances by which
culturally contingent considerations arise in evaluation, and emphasize the
importance of researcher positionality for fostering inclusive, culturally
aligned NLP research. Finally, we discuss implications and future directions
for moving beyond current benchmarking practices, discovering important
applications that we don’t know exist, and involving communities in evaluation
design through HCI-inspired participatory methodologies.
[131] Expanding the WMT24++ Benchmark with Rumantsch Grischun, Sursilvan, Sutsilvan, Surmiran, Puter, and Vallader
Jannis Vamvas, Ignacio Pérez Prat, Not Battesta Soliva, Sandra Baltermia-Guetg, Andrina Beeli, Simona Beeli, Madlaina Capeder, Laura Decurtins, Gian Peder Gregori, Flavia Hobi, Gabriela Holderegger, Arina Lazzarini, Viviana Lazzarini, Walter Rosselli, Bettina Vital, Anna Rutkiewicz, Rico Sennrich
Main category: cs.CL
TL;DR: A benchmark for machine translation evaluation of six Romansh language varieties was created using human translations from the WMT24++ benchmark, showing that translation from Romansh to German works well but translation into Romansh remains challenging.
Details
Motivation: The Romansh language spoken in Switzerland has limited resources for machine translation evaluation, creating a need for standardized benchmarks to assess translation quality across its six varieties.Method: Created reference translations by human translators based on the WMT24++ benchmark, ensuring parallelism with over 55 other languages, then conducted automatic evaluation of existing MT systems and LLMs.
Result: Translation out of Romansh into German is handled relatively well for all six varieties, but translation into Romansh is still challenging for current systems.
Conclusion: The benchmark provides essential resources for evaluating Romansh machine translation, highlighting the ongoing difficulty of translating into Romansh compared to translating from Romansh.
Abstract: The Romansh language, spoken in Switzerland, has limited resources for machine translation evaluation. In this paper, we present a benchmark for six varieties of Romansh: Rumantsch Grischun, a supra-regional variety, and five regional varieties: Sursilvan, Sutsilvan, Surmiran, Puter, and Vallader. Our reference translations were created by human translators based on the WMT24++ benchmark, which ensures parallelism with more than 55 other languages. An automatic evaluation of existing MT systems and LLMs shows that translation out of Romansh into German is handled relatively well for all the varieties, but translation into Romansh is still challenging.
[132] No Encore: Unlearning as Opt-Out in Music Generation
Jinju Kim, Taehan Kim, Abdul Waheed, Jong Hwan, Rita Singh
Main category: cs.CL
TL;DR: This paper applies machine unlearning techniques to text-to-music models to prevent inadvertent usage of copyrighted content while maintaining model performance.
Details
Motivation: AI music generation systems pose risks of exploiting copyrighted creations, raising ethical and legal concerns that need to be addressed.Method: The researchers explore existing machine unlearning methods applied to a pre-trained Text-to-Music baseline, analyzing their efficacy in unlearning pre-trained datasets.
Result: Preliminary results provide insights into the challenges of applying unlearning in music generation.
Conclusion: The study offers a foundational analysis for future works on applying unlearning techniques to music generative models to address copyright concerns.
Abstract: AI music generation is rapidly emerging in the creative industries, enabling intuitive music generation from textual descriptions. However, these systems pose risks in exploitation of copyrighted creations, raising ethical and legal concerns. In this paper, we present preliminary results on the first application of machine unlearning techniques from an ongoing research to prevent inadvertent usage of creative content. Particularly, we explore existing methods in machine unlearning to a pre-trained Text-to-Music (TTM) baseline and analyze their efficacy in unlearning pre-trained datasets without harming model performance. Through our experiments, we provide insights into the challenges of applying unlearning in music generation, offering a foundational analysis for future works on the application of unlearning for music generative models.
[133] Interdisciplinary Research in Conversation: A Case Study in Computational Morphology for Language Documentation
Enora Rice, Katharina von der Wense, Alexis Palmer
Main category: cs.CL
TL;DR: Computational morphology tools have limited real-world adoption in language documentation due to misalignment between NLP research and practice. The paper argues for User-Centered Design to bridge this gap.
Details
Motivation: To address the disconnect between computational morphology research outputs and their practical application in language documentation settings, highlighting the risk of NLP becoming decontextualized without systematic UCD integration.Method: Presents a case study of GlossLM (a state-of-the-art multilingual IGT generation model) through a small-scale user study with three documentary linguists to evaluate real-world usability despite strong metric-based performance.
Result: The study found that despite strong performance metrics, GlossLM failed to meet core usability needs in real documentation contexts, revealing issues with model constraints, label standardization, segmentation, and personalization.
Conclusion: Centering users through UCD not only produces more effective tools but also surfaces richer, more relevant research directions for computational morphology in language documentation.
Abstract: Computational morphology has the potential to support language documentation through tasks like morphological segmentation and the generation of Interlinear Glossed Text (IGT). However, our research outputs have seen limited use in real-world language documentation settings. This position paper situates the disconnect between computational morphology and language documentation within a broader misalignment between research and practice in NLP and argues that the field risks becoming decontextualized and ineffectual without systematic integration of User-Centered Design (UCD). To demonstrate how principles from UCD can reshape the research agenda, we present a case study of GlossLM, a state-of-the-art multilingual IGT generation model. Through a small-scale user study with three documentary linguists, we find that despite strong metric based performance, the system fails to meet core usability needs in real documentation contexts. These insights raise new research questions around model constraints, label standardization, segmentation, and personalization. We argue that centering users not only produces more effective tools, but surfaces richer, more relevant research directions
[134] Synthetic bootstrapped pretraining
Zitong Yang, Aonan Zhang, Hong Liu, Tatsunori Hashimoto, Emmanuel Candès, Chong Wang, Ruoming Pang
Main category: cs.CL
TL;DR: Synthetic Bootstrapped Pretraining (SBP) is a method that learns inter-document relations to synthesize new training data, improving language model performance beyond standard token-level pretraining.
Details
Motivation: Standard LM pretraining only models token-level correlations within single documents, missing the rich inter-document correlations that could lead to better performance. SBP aims to efficiently capture these learnable relationships between documents.Method: SBP first learns a model of relations between documents from the pretraining dataset, then uses this model to synthesize a vast new corpus for joint training. The synthesizer abstracts core concepts from seed material and creates new narratives.
Result: A 3B-parameter model pretrained with SBP on up to 1T tokens consistently outperforms repetition baselines and achieves significant performance improvement comparable to an oracle with 20x more unique data. Qualitative analysis shows synthesized documents go beyond paraphrasing.
Conclusion: SBP effectively captures latent concepts shared between related documents, providing strong empirical performance improvements while admitting a natural Bayesian interpretation of learning document abstractions.
Abstract: We introduce Synthetic Bootstrapped Pretraining (SBP), a language model (LM) pretraining procedure that first learns a model of relations between documents from the pretraining dataset and then leverages it to synthesize a vast new corpus for joint training. While the standard pretraining teaches LMs to learn causal correlations among tokens within a single document, it is not designed to efficiently model the rich, learnable inter-document correlations that can potentially lead to better performance. We validate SBP by designing a compute-matched pretraining setup and pretrain a 3B-parameter model on up to 1T tokens from scratch. We find SBP consistently improves upon a strong repetition baseline and delivers a significant fraction of performance improvement attainable by an oracle upper bound with access to 20x more unique data. Qualitative analysis reveals that the synthesized documents go beyond mere paraphrases – SBP first abstracts a core concept from the seed material and then crafts a new narration on top of it. Besides strong empirical performance, SBP admits a natural Bayesian interpretation: the synthesizer implicitly learns to abstract the latent concepts shared between related documents.
[135] Frustratingly Easy Data Augmentation for Low-Resource ASR
Katsumi Ibaraki, David Chiang
Main category: cs.CL
TL;DR: Three self-contained data augmentation methods for low-resource ASR using text generation and TTS, showing significant WER reductions across multiple languages.
Details
Motivation: To address the challenge of limited training data in low-resource Automatic Speech Recognition (ASR) systems by creating synthetic audio data from text augmentation.Method: Three text augmentation techniques (gloss-based replacement, random replacement, LLM-based approach) followed by Text-to-Speech (TTS) to generate synthetic audio, applied to pretrained Wav2Vec2-XLSR-53 model fine-tuning.
Result: Significant performance gains including 14.3% absolute WER reduction for Nashta, effective across all four low-resource languages (Vatlongos, Nashta, Shinekhen Buryat, Kakabe) and utility for high-resource languages like English.
Conclusion: The proposed data augmentation methods are broadly applicable and effective for improving ASR performance in both low-resource and high-resource language scenarios.
Abstract: This paper introduces three self-contained data augmentation methods for low-resource Automatic Speech Recognition (ASR). Our techniques first generate novel text–using gloss-based replacement, random replacement, or an LLM-based approach–and then apply Text-to-Speech (TTS) to produce synthetic audio. We apply these methods, which leverage only the original annotated data, to four languages with extremely limited resources (Vatlongos, Nashta, Shinekhen Buryat, and Kakabe). Fine-tuning a pretrained Wav2Vec2-XLSR-53 model on a combination of the original audio and generated synthetic data yields significant performance gains, including a 14.3% absolute WER reduction for Nashta. The methods prove effective across all four low-resource languages and also show utility for high-resource languages like English, demonstrating their broad applicability.
[136] Benchmarking Contextual and Paralinguistic Reasoning in Speech-LLMs: A Case Study with In-the-Wild Data
Qiongqiong Wang, Hardik Bhupendra Sailor, Tianchi Liu, Wenyu Zhang, Muhammad Huzaifah, Nattadaporn Lertcheva, Shuo Sun, Nancy F. Chen, Jinyang Wu, AiTi Aw
Main category: cs.CL
TL;DR: CP-Bench is a new benchmark for evaluating speech-LLMs on contextual paralinguistic reasoning, addressing the gap in understanding non-verbal cues like emotion and prosody.
Details
Motivation: Current speech-LLMs excel at transcription and translation but lack understanding of paralinguistic aspects crucial for social and emotional intelligence.Method: Created two curated question answering datasets requiring linguistic and empathetic understanding, evaluated state-of-the-art speech-LLMs from open and closed-source models, and analyzed performance across question types with temperature tuning experiments.
Result: The benchmark reveals significant gaps in existing evaluations and provides insights into model performance on paralinguistic reasoning tasks.
Conclusion: CP-Bench identifies key limitations in current speech-LLMs and offers guidance for developing more context-aware and emotionally intelligent speech-capable language models.
Abstract: Recent speech-LLMs have shown impressive performance in tasks like transcription and translation, yet they remain limited in understanding the paralinguistic aspects of speech crucial for social and emotional intelligence. We propose CP-Bench, a benchmark for evaluating speech-LLMs on contextual paralinguistic reasoning the integration of verbal content with non-verbal cues like emotion and prosody. The benchmark includes two curated question answering (QA) datasets requiring both linguistic and empathetic understanding. We evaluate state-of-the-art speech-LLMs from both open and closed-source models and perform a comprehensive analysis across different question types. The top two models were further analyzed under temperature tuning to understand its effect on this task. Our benchmark reveals a key gap in existing evaluations and offers insights into building more context-aware and emotionally intelligent speech-capable LLMs.
[137] Can LLMs Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning
Tianle Zhang, Wanlong Fang, Jonathan Woo, Paridhi Latawa, Deepak A. Subramanian, Alvin Chan
Main category: cs.CL
TL;DR: ICRL is a training-free framework that enables LLMs to use non-text modality representations from foundational models through in-context learning, replacing text inputs with FM representations for multi-modal inference.
Details
Motivation: Existing approaches for integrating non-text modality representations into LLMs require costly supervised training, limiting on-the-fly adaptation to new domains and modalities.Method: Proposes In-Context Representation Learning (ICRL) which maps FM representations into LLMs without training, using few-shot learning where text inputs are replaced with FM representations.
Result: Evaluated on molecular domain tasks, investigating mapping methods, performance factors, and underlying mechanisms of ICRL effectiveness.
Conclusion: ICRL presents the first training-free framework for integrating non-text modality representations into text-based LLMs, offering promising direction for adaptable multi-modal generalization.
Abstract: The remarkable performance of Large Language Models (LLMs) can be enhanced with test-time computation, which relies on external tools and even other deep learning models. However, existing approaches for integrating non-text modality representations into LLMs typically require additional costly supervised training, restricting on-the-fly adaptation to new domains and modalities. In this work, we explore the feasibility of integrating representations from non-text foundational models (FMs) into text-based LLMs in a training-free manner. We propose In-Context Representation Learning (ICRL) as a proof-of-concept to allow LLMs to adaptively utilize non-text modality representations with few-shot learning. Unlike traditional in-context learning, which incorporates text-label pairs, ICRL replaces text inputs with FM representations, enabling the LLM to perform multi-modal inference without fine-tuning. We evaluate ICRL on a suite of tasks in the molecular domain, investigating three core research questions: (i) how to map FM representations into LLMs in a training-free manner, (ii) what factors influence ICRL performance, and (iii) what mechanisms underlie the effectiveness of ICRL. To the best of our knowledge, ICRL is the first training-free framework for integrating non-text modality representations into text-based LLMs, presenting a promising direction for adaptable, multi-modal generalization.
[138] When Long Helps Short: How Context Length in Supervised Fine-tuning Affects Behavior of Large Language Models
Yingming Zheng, Hanqi Li, Kai Yu, Lu Chen
Main category: cs.CL
TL;DR: Long-context SFT improves short-context performance, contrary to long-context pretraining effects. Both MHA and FFN components benefit independently, with long-context SFT promoting contextual knowledge while short-context SFT favors parametric knowledge. Hybrid training mitigates this bias.
Details
Motivation: To understand how SFT data length influences LLM behavior on short-context tasks, as the effects of long-context pretraining are well-studied but SFT implications remain unclear.Method: Systematically investigate SFT data length effects by decoupling and analyzing Multi-Head Attention (MHA) and Feed-Forward Network (FFN) components, studying their interaction, and testing hybrid training approaches.
Result: Long-context SFT improves short-context performance, with both MHA and FFN benefiting independently. Long-context SFT promotes contextual knowledge while short-context SFT favors parametric knowledge, creating a knowledge preference bias.
Conclusion: Hybrid training mitigates the knowledge preference bias and offers explainable guidance for fine-tuning LLMs, showing that exclusive reliance on long-context SFT is suboptimal.
Abstract: Large language models (LLMs) have achieved impressive performance across natural language processing (NLP) tasks. As real-world applications increasingly demand longer context windows, continued pretraining and supervised fine-tuning (SFT) on long-context data has become a common approach. While the effects of data length in continued pretraining have been extensively studied, their implications for SFT remain unclear. In this work, we systematically investigate how SFT data length influences LLM behavior on short-context tasks. Counterintuitively, we find that long-context SFT improves short-context performance, contrary to the commonly observed degradation from long-context pretraining. To uncover the underlying mechanisms of this phenomenon, we first decouple and analyze two key components, Multi-Head Attention (MHA) and Feed-Forward Network (FFN), and show that both independently benefit from long-context SFT. We further study their interaction and reveal a knowledge preference bias: long-context SFT promotes contextual knowledge, while short-context SFT favors parametric knowledge, making exclusive reliance on long-context SFT suboptimal. Finally, we demonstrate that hybrid training mitigates this bias, offering explainable guidance for fine-tuning LLMs.
[139] MAPEX: A Multi-Agent Pipeline for Keyphrase Extraction
Liting Zhang, Shiwan Zhao, Aobo Kong, Qicheng Li
Main category: cs.CL
TL;DR: MAPEX is a novel multi-agent framework for keyphrase extraction that dynamically adapts to document length through dual-path strategies, outperforming state-of-the-art methods.
Details
Motivation: Existing unsupervised prompt-based methods for LLMs use uniform prompting regardless of document length or LLM backbone, limiting their ability to fully exploit LLMs' reasoning capabilities for complex keyphrase extraction tasks.Method: MAPEX introduces multi-agent collaboration with modules for expert recruitment, candidate extraction, topic guidance, knowledge augmentation, and post-processing. It uses a dual-path strategy: knowledge-driven extraction for short texts and topic-guided extraction for long texts.
Result: Extensive experiments on six benchmark datasets across three LLMs show MAPEX outperforms state-of-the-art unsupervised methods by 2.44% and standard LLM baselines by 4.01% in F1@5 on average.
Conclusion: MAPEX demonstrates strong generalization and universality, proving that multi-agent collaboration effectively enhances keyphrase extraction performance across diverse scenarios and document lengths.
Abstract: Keyphrase extraction is a fundamental task in natural language processing. However, existing unsupervised prompt-based methods for Large Language Models (LLMs) often rely on single-stage inference pipelines with uniform prompting, regardless of document length or LLM backbone. Such one-size-fits-all designs hinder the full exploitation of LLMs’ reasoning and generation capabilities, especially given the complexity of keyphrase extraction across diverse scenarios. To address these challenges, we propose MAPEX, the first framework that introduces multi-agent collaboration into keyphrase extraction. MAPEX coordinates LLM-based agents through modules for expert recruitment, candidate extraction, topic guidance, knowledge augmentation, and post-processing. A dual-path strategy dynamically adapts to document length: knowledge-driven extraction for short texts and topic-guided extraction for long texts. Extensive experiments on six benchmark datasets across three different LLMs demonstrate its strong generalization and universality, outperforming the state-of-the-art unsupervised method by 2.44% and standard LLM baselines by 4.01% in F1@5 on average. Code is available at https://github.com/NKU-LITI/MAPEX.
[140] Charting a Decade of Computational Linguistics in Italy: The CLiC-it Corpus
Chiara Alzetta, Serena Auriemma, Alessandro Bondielli, Luca Dini, Chiara Fazzone, Alessio Miaschi, Martina Miliani, Marta Sartor
Main category: cs.CL
TL;DR: Analysis of 10 years of Italian CL/NLP research trends through CLiC-it conference proceedings (2014-2024), tracking evolution from lexical resources to LLMs and multimodality.
Details
Motivation: To understand how Computational Linguistics and Natural Language Processing research in Italy has evolved over the past decade, particularly with the rise of Transformer-based LLMs, and to track shifting research priorities and trends.Method: Compiled proceedings from 10 editions of CLiC-it conference (2014-2024) into a corpus, analyzing both metadata (author provenance, gender, affiliations) and paper content to identify research trends and developments.
Result: Identified a significant shift in research focus from Lexical and Semantic Resources to Language Modelling and Multimodality, reflecting broader field transformations driven by LLM advancements.
Conclusion: Provides valuable insights into Italian CL/NLP community’s evolution, supporting informed decisions and future research directions by documenting key developments and emerging trends over a decade.
Abstract: Over the past decade, Computational Linguistics (CL) and Natural Language Processing (NLP) have evolved rapidly, especially with the advent of Transformer-based Large Language Models (LLMs). This shift has transformed research goals and priorities, from Lexical and Semantic Resources to Language Modelling and Multimodality. In this study, we track the research trends of the Italian CL and NLP community through an analysis of the contributions to CLiC-it, arguably the leading Italian conference in the field. We compile the proceedings from the first 10 editions of the CLiC-it conference (from 2014 to 2024) into the CLiC-it Corpus, providing a comprehensive analysis of both its metadata, including author provenance, gender, affiliations, and more, as well as the content of the papers themselves, which address various topics. Our goal is to provide the Italian and international research communities with valuable insights into emerging trends and key developments over time, supporting informed decisions and future directions in the field.
[141] Soft Tokens, Hard Truths
Natasha Butt, Ariel Kwiatkowski, Ismail Labiad, Julia Kempe, Yann Ollivier
Main category: cs.CL
TL;DR: This paper introduces a scalable reinforcement learning method to train continuous Chain-of-Thought tokens without distillation, enabling efficient learning of diverse reasoning paths with minimal computational overhead.
Details
Motivation: Continuous tokens offer greater expressivity and efficiency for reasoning tasks, but practical use has been limited by training difficulties and computational costs in previous approaches.Method: Uses ‘soft’ tokens (mixtures of tokens with noise on input embeddings) via reinforcement learning to learn continuous CoTs with hundreds of tokens, avoiding distillation from discrete CoTs.
Result: On math reasoning benchmarks with Llama and Qwen models up to 8B, continuous CoTs match discrete CoTs for pass@1 and surpass them for pass@32, showing greater diversity. Best performance comes from training with continuous CoTs then using discrete tokens for inference.
Conclusion: Continuous CoT RL training provides a scalable approach that preserves base model predictions on out-of-domain tasks while enabling more diverse reasoning paths.
Abstract: The use of continuous instead of discrete tokens during the Chain-of-Thought (CoT) phase of reasoning LLMs has garnered attention recently, based on the intuition that a continuous mixture of discrete tokens could simulate a superposition of several reasoning paths simultaneously. Theoretical results have formally proven that continuous tokens have much greater expressivity and can solve specific problems more efficiently. However, practical use of continuous tokens has been limited by strong training difficulties: previous works either just use continuous tokens at inference time on a pre-trained discrete-token model, or must distill the continuous CoT from ground-truth discrete CoTs and face computational costs that limit the CoT to very few tokens. This is the first work introducing a scalable method to learn continuous CoTs via reinforcement learning (RL), without distilling from reference discrete CoTs. We use “soft” tokens: mixtures of tokens together with noise on the input embedding to provide RL exploration. Computational overhead is minimal, enabling us to learn continuous CoTs with hundreds of tokens. On math reasoning benchmarks with Llama and Qwen models up to 8B, training with continuous CoTs match discrete-token CoTs for pass@1 and surpass them for pass@32, showing greater CoT diversity. In systematic comparisons, the best-performing scenario is to train with continuous CoT tokens then use discrete tokens for inference, meaning the “soft” models can be deployed in a standard way. Finally, we show continuous CoT RL training better preserves the predictions of the base model on out-of-domain tasks, thus providing a softer touch to the base model.
[142] Online Process Reward Leanring for Agentic Reinforcement Learning
Xiaoqian Liu, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li, Junge Zhang, Jianbin Jiao
Main category: cs.CL
TL;DR: OPRL introduces an online process reward learning method that transforms trajectory preferences into step-level rewards for better credit assignment in agentic RL, achieving state-of-the-art performance with improved sample efficiency.
Details
Motivation: Sparse and unverifiable rewards in LLM-based agent training make temporal credit assignment challenging, and existing process supervision methods suffer from biased annotation, reward hacking, and high variance.Method: OPRL optimizes an implicit process reward model alternately with the agent’s policy using trajectory-based DPO to convert trajectory preferences into step rewards, which are combined with outcome rewards for policy updates.
Result: OPRL achieves superior performance over frontier LLMs and strong RL baselines across WebShop, VisualSokoban, and SOTOPIA benchmarks, with state-of-the-art results, higher sample efficiency, and lower variance.
Conclusion: OPRL provides an effective credit-assignment strategy for agentic RL that enables efficient exploration and demonstrates strong potential for real-world agent learning scenarios.
Abstract: Large language models (LLMs) are increasingly trained with reinforcement learning (RL) as autonomous agents that reason and act over long horizons in interactive environments. However, sparse and sometimes unverifiable rewards make temporal credit assignment extremely challenging. Recent work attempts to integrate process supervision into agent learning but suffers from biased annotation, reward hacking, high-variance from overly fine-grained signals or failtures when state overlap is rare. We therefore introduce Online Process Reward Learning (OPRL), a general credit-assignment strategy for agentic RL that integrates seamlessly with standard on-policy algorithms without relying on additional rollouts or explicit step labels. In OPRL, we optimize an implicit process reward model (PRM) alternately with the agent’s policy to transform trajectory preferences into implicit step rewards through a trajectory-based DPO objective. These step rewards are then used to compute step-level advantages, which are combined with episode-level advantages from outcome rewards for policy update, creating a self-reinforcing loop. Theoretical findings guarantee that the learned step rewards are consistent with trajectory preferences and act as potential-based shaping rewards, providing bounded gradients to stabilize training. Empirically, we evaluate OPRL on three distinct agent benmarks, including WebShop and VisualSokoban, as well as open-ended social interactions with unverfiable rewards in SOTOPIA. Crucially, OPRL shows superior performance over frontier LLMs and strong RL baselines across domains, achieving state-of-the-art results with higher sample-efficiency and lower variance during training. Further analysis also demonstrates the efficient exploration by OPRL using fewer actions, underscoring its potential for agentic learning in real-world scenarios.
cs.CV
[143] Vision-Based Perception for Autonomous Vehicles in Off-Road Environment Using Deep Learning
Nelson Alves Ferreira Neto
Main category: cs.CV
TL;DR: Proposes CMSNet framework for real-time semantic segmentation of drivable regions in off-road environments, introduces Kamino dataset with 12,000 images, and achieves real-time performance through optimization with TensorRT/CUDA.
Details
Motivation: Need for low-latency intelligent systems for autonomous driving on non-uniform terrain like unpaved roads and open-pit mines, especially under adverse conditions (night, rain, dust).Method: Configurable Modular Segmentation Network (CMSNet) framework with different architectural arrangements, trained for obstacle and trafficable ground segmentation. Optimized for real-time inference by methodically removing/fusing CNN layers using TensorRT, C++, and CUDA.
Result: Validated effectiveness on two datasets, capable of navigating rough terrain without predefined trails. Kamino dataset provides extensive labeled data with high pixel-level annotations from eight synchronized cameras.
Conclusion: CMSNet enables real-time semantic segmentation for autonomous vehicles in challenging off-road environments, addressing visibility impairments and terrain variability through modular architecture and optimization techniques.
Abstract: Low-latency intelligent systems are required for autonomous driving on non-uniform terrain in open-pit mines and developing countries. This work proposes a perception system for autonomous vehicles on unpaved roads and off-road environments, capable of navigating rough terrain without a predefined trail. The Configurable Modular Segmentation Network (CMSNet) framework is proposed, facilitating different architectural arrangements. CMSNet configurations were trained to segment obstacles and trafficable ground on new images from unpaved/off-road scenarios with adverse conditions (night, rain, dust). We investigated applying deep learning to detect drivable regions without explicit track boundaries, studied algorithm behavior under visibility impairment, and evaluated field tests with real-time semantic segmentation. A new dataset, Kamino, is presented with almost 12,000 images from an operating vehicle with eight synchronized cameras. The Kamino dataset has a high number of labeled pixels compared to similar public collections and includes images from an off-road proving ground emulating a mine under adverse visibility. To achieve real-time inference, CMSNet CNN layers were methodically removed and fused using TensorRT, C++, and CUDA. Empirical experiments on two datasets validated the proposed system’s effectiveness.
[144] Overview of LifeCLEF Plant Identification task 2020
Herve Goeau, Pierre Bonnet, Alexis Joly
Main category: cs.CV
TL;DR: The LifeCLEF 2020 Plant Identification challenge aimed to improve automated plant identification in biodiversity-rich but data-deficient tropical regions by leveraging digitized herbarium collections alongside limited field photos.
Details
Motivation: Current automated plant identification systems are biased toward well-documented regions (North America, Western Europe) while biodiversity-rich tropical areas lack sufficient training data. However, centuries of herbarium collections from these regions provide a valuable alternative data source.Method: The challenge used a cross-domain classification approach where training combined hundreds of thousands of herbarium sheets with thousands of field photos to learn mapping between domains. The test set consisted exclusively of field photos from South America’s Guiana Shield region (about 1,000 species).
Result: The evaluation assessed how well automated systems could identify plants in data-deficient tropical regions by leveraging herbarium collections as training data alongside limited field photos.
Conclusion: Herbarium collections can significantly enhance automated plant identification in biodiversity-rich but data-scarce tropical regions, demonstrating the value of historical botanical collections for modern AI applications in biodiversity informatics.
Abstract: Automated identification of plants has improved considerably thanks to the recent progress in deep learning and the availability of training data with more and more photos in the field. However, this profusion of data only concerns a few tens of thousands of species, mostly located in North America and Western Europe, much less in the richest regions in terms of biodiversity such as tropical countries. On the other hand, for several centuries, botanists have collected, catalogued and systematically stored plant specimens in herbaria, particularly in tropical regions, and the recent efforts by the biodiversity informatics community made it possible to put millions of digitized sheets online. The LifeCLEF 2020 Plant Identification challenge (or “PlantCLEF 2020”) was designed to evaluate to what extent automated identification on the flora of data deficient regions can be improved by the use of herbarium collections. It is based on a dataset of about 1,000 species mainly focused on the South America’s Guiana Shield, an area known to have one of the greatest diversity of plants in the world. The challenge was evaluated as a cross-domain classification task where the training set consist of several hundred thousand herbarium sheets and few thousand of photos to enable learning a mapping between the two domains. The test set was exclusively composed of photos in the field. This paper presents the resources and assessments of the conducted evaluation, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of the main outcomes.
[145] iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning
Manyi Yao, Bingbing Zhuang, Sparsh Garg, Amit Roy-Chowdhury, Christian Shelton, Manmohan Chandraker, Abhishek Aich
Main category: cs.CV
TL;DR: iFinder is a structured semantic grounding framework that translates dash-cam videos into hierarchical, interpretable data structures for LLMs, enabling better spatial reasoning and causal inference in driving video analysis.
Details
Motivation: Existing vision-language models struggle with spatial reasoning and explainability in dash-cam driving video analysis due to lack of domain-specific inductive biases and structured representations.Method: Modular, training-free pipeline using pretrained vision models to extract object pose, lane positions, and trajectories, organized hierarchically with a three-block prompting strategy for step-wise reasoning.
Result: Significantly outperforms end-to-end V-VLMs on four zero-shot driving benchmarks, with up to 39% gains in accident reasoning accuracy.
Conclusion: iFinder provides a zero-shot, interpretable, and reliable alternative to end-to-end V-VLMs for post-hoc driving video understanding through domain-specific grounding.
Abstract: Grounding large language models (LLMs) in domain-specific tasks like post-hoc dash-cam driving video analysis is challenging due to their general-purpose training and lack of structured inductive biases. As vision is often the sole modality available for such analysis (i.e., no LiDAR, GPS, etc.), existing video-based vision-language models (V-VLMs) struggle with spatial reasoning, causal inference, and explainability of events in the input video. To this end, we introduce iFinder, a structured semantic grounding framework that decouples perception from reasoning by translating dash-cam videos into a hierarchical, interpretable data structure for LLMs. iFinder operates as a modular, training-free pipeline that employs pretrained vision models to extract critical cues – object pose, lane positions, and object trajectories – which are hierarchically organized into frame- and video-level structures. Combined with a three-block prompting strategy, it enables step-wise, grounded reasoning for the LLM to refine a peer V-VLM’s outputs and provide accurate reasoning. Evaluations on four public dash-cam video benchmarks show that iFinder’s proposed grounding with domain-specific cues, especially object orientation and global context, significantly outperforms end-to-end V-VLMs on four zero-shot driving benchmarks, with up to 39% gains in accident reasoning accuracy. By grounding LLMs with driving domain-specific representations, iFinder offers a zero-shot, interpretable, and reliable alternative to end-to-end V-VLMs for post-hoc driving video understanding.
[146] CURE: Centroid-guided Unsupervised Representation Erasure for Facial Recognition Systems
Fnu Shivam, Nima Najafzadeh, Yenumula Reddy, Prashnna Gyawali
Main category: cs.CV
TL;DR: CURE is the first unsupervised unlearning framework for facial recognition systems that removes targeted samples without identity labels while preserving model performance, outperforming existing methods.
Details
Motivation: Facial recognition systems raise privacy concerns requiring data removal, but current unlearning techniques depend on supervised identity labels that are often unavailable in privacy-constrained or noisy datasets.Method: CURE (Centroid-guided Unsupervised Representation Erasure) uses an unsupervised approach without identity labels, and introduces the Unlearning Efficiency Score (UES) metric to balance forgetting and retention stability.
Result: CURE significantly outperforms unsupervised variants of existing unlearning methods and demonstrates effective quality-aware unlearning by removing low-quality images.
Conclusion: The framework successfully addresses the gap in unsupervised unlearning for facial recognition, showing the importance of image quality in machine unlearning processes.
Abstract: In the current digital era, facial recognition systems offer significant utility and have been widely integrated into modern technological infrastructures; however, their widespread use has also raised serious privacy concerns, prompting regulations that mandate data removal upon request. Machine unlearning has emerged as a powerful solution to address this issue by selectively removing the influence of specific user data from trained models while preserving overall model performance. However, existing machine unlearning techniques largely depend on supervised techniques requiring identity labels, which are often unavailable in privacy-constrained situations or in large-scale, noisy datasets. To address this critical gap, we introduce CURE (Centroid-guided Unsupervised Representation Erasure), the first unsupervised unlearning framework for facial recognition systems that operates without the use of identity labels, effectively removing targeted samples while preserving overall performance. We also propose a novel metric, the Unlearning Efficiency Score (UES), which balances forgetting and retention stability, addressing shortcomings in the current evaluation metrics. CURE significantly outperforms unsupervised variants of existing unlearning methods. Additionally, we conducted quality-aware unlearning by designating low-quality images as the forget set, demonstrating its usability and benefits, and highlighting the role of image quality in machine unlearning.
[147] Synthesizing Artifact Dataset for Pixel-level Detection
Dennis Menn, Feng Liang, Diana Marculescu
Main category: cs.CV
TL;DR: Proposes an artifact corruption pipeline that automatically injects artifacts into clean synthetic images to generate pixel-level annotations, eliminating the need for expensive manual labeling.
Details
Motivation: Training artifact detectors requires expensive pixel-level human annotations, and existing pseudo-labeling approaches suffer from noisy labels that limit detector performance.Method: An artifact corruption pipeline that automatically injects artifacts into clean, high-quality synthetic images on predetermined regions to produce pixel-level annotations without manual labeling.
Result: Achieves performance improvements of 13.2% for ConvNeXt and 3.7% for Swin-T compared to baseline approaches, as verified on human-labeled data.
Conclusion: This work represents an initial step toward scalable pixel-level artifact annotation datasets that integrate world knowledge into artifact detection.
Abstract: Artifact detectors have been shown to enhance the performance of image-generative models by serving as reward models during fine-tuning. These detectors enable the generative model to improve overall output fidelity and aesthetics. However, training the artifact detector requires expensive pixel-level human annotations that specify the artifact regions. The lack of annotated data limits the performance of the artifact detector. A naive pseudo-labeling approach-training a weak detector and using it to annotate unlabeled images-suffers from noisy labels, resulting in poor performance. To address this, we propose an artifact corruption pipeline that automatically injects artifacts into clean, high-quality synthetic images on a predetermined region, thereby producing pixel-level annotations without manual labeling. The proposed method enables training of an artifact detector that achieves performance improvements of 13.2% for ConvNeXt and 3.7% for Swin-T, as verified on human-labeled data, compared to baseline approaches. This work represents an initial step toward scalable pixel-level artifact annotation datasets that integrate world knowledge into artifact detection.
[148] Parameter-Efficient Multi-Task Learning via Progressive Task-Specific Adaptation
Neeraj Gangwar, Anshuka Rangi, Rishabh Deshmukh, Holakou Rahmanian, Yesh Dattatreya, Nickvash Kani
Main category: cs.CV
TL;DR: A novel parameter-efficient multi-task learning approach that uses progressive task-specific adapter modules, with shared adapters in early layers and task-specific adapters in later layers, plus gradient-based task similarity for optimal adapter allocation.
Details
Motivation: To address task interference and negative transfer in parameter-efficient fine-tuning for multi-task learning, which are exacerbated by limited trainable parameters when extending single-task methods to multi-task scenarios.Method: Introduces progressive task-specific adapter modules in pre-trained models - shared across tasks in initial layers, becoming more task-specific in later layers. Uses gradient-based task similarity computation to allocate similar tasks to shared adapter modules.
Result: Outperforms fully fine-tuned multi-task models while using only one-fifth of trainable parameters. Achieves better relative improvement to single-task fine-tuning and surpasses state-of-the-art parameter-efficient multi-task learning methods on PASCAL and NYUD-v2 datasets.
Conclusion: The progressive task-specific adaptation approach effectively reduces task conflicts by enabling transfer learning in early layers and task-specific learning in later layers, providing superior parameter efficiency and performance in multi-task learning.
Abstract: Parameter-efficient fine-tuning methods have emerged as a promising solution for adapting pre-trained models to various downstream tasks. While these methods perform well in single-task learning, extending them to multi-task learning exacerbates common challenges, such as task interference and negative transfer, due to the limited number of trainable parameters. To address these issues, we introduce progressive task-specific multi-task adaptation, a novel parameter-efficient approach for multi-task learning. This approach introduces adapter modules in a pre-trained model such that these modules are shared across all tasks in the initial layers and become progressively more task-specific in the later layers. The motivation is to reduce the conflicts among tasks by allowing transfer learning across all tasks in the initial layers and enabling task-specific learning toward the prediction heads. Additionally, we propose a gradient-based approach for computing task similarity and use this measure to allocate similar tasks to the shared adapter modules. Our task similarity method introduces minimal overhead in the pipeline. We evaluate our approach by adapting the Swin Transformer for dense prediction tasks. Experiments on the PASCAL and NYUD-v2 datasets demonstrate that our approach outperforms a fully fine-tuned multi-task model while requiring only one-fifth of the trainable parameters. This approach achieves better relative improvement to single-task fine-tuning while reducing the number of trainable parameters and surpasses the current state-of-the-art methods for parameter-efficient multi-task learning.
[149] Raw-JPEG Adapter: Efficient Raw Image Compression with JPEG
Mahmoud Afifi, Ran Zhang, Michael S. Brown
Main category: cs.CV
TL;DR: RawJPEG Adapter is a learnable, invertible preprocessing pipeline that adapts raw images for standard JPEG compression, enabling accurate raw reconstruction while maintaining high compression efficiency.
Details
Motivation: Raw data preserves full sensor information but requires large storage (e.g., DNG format), while JPEG offers high compression but is unsuitable for raw storage. There's a need for a solution that combines JPEG's efficiency with raw data preservation.Method: A lightweight, learnable, and invertible preprocessing pipeline that applies spatial and optional frequency-domain transforms to raw images before JPEG compression. Compact parameters are stored in the JPEG comment field to enable accurate reconstruction.
Result: Experiments show higher fidelity than direct JPEG storage, support for other codecs, and favorable trade-off between compression ratio and reconstruction accuracy across multiple datasets.
Conclusion: RawJPEG Adapter provides an effective solution for storing raw-like information in standard JPEG format, offering practical benefits for constrained scenarios while maintaining reconstruction accuracy.
Abstract: Digital cameras digitize scene light into linear raw representations, which the image signal processor (ISP) converts into display-ready outputs. While raw data preserves full sensor information–valuable for editing and vision tasks–formats such as Digital Negative (DNG) require large storage, making them impractical in constrained scenarios. In contrast, JPEG is a widely supported format, offering high compression efficiency and broad compatibility, but it is not well-suited for raw storage. This paper presents RawJPEG Adapter, a lightweight, learnable, and invertible preprocessing pipeline that adapts raw images for standard JPEG compression. Our method applies spatial and optional frequency-domain transforms, with compact parameters stored in the JPEG comment field, enabling accurate raw reconstruction. Experiments across multiple datasets show that our method achieves higher fidelity than direct JPEG storage, supports other codecs, and provides a favorable trade-off between compression ratio and reconstruction accuracy.
[150] CLOSP: A Unified Semantic Space for SAR, MSI, and Text in Remote Sensing
Daniele Rege Cambrin, Lorenzo Vaiani, Giuseppe Gallipoli, Luca Cagliero, Paolo Garza
Main category: cs.CV
TL;DR: CrisisLandMark introduces a large-scale corpus of 647,000+ Sentinel-1 SAR and Sentinel-2 multispectral images with structured text annotations, and CLOSP framework that uses text to align optical and SAR images into unified embedding space, achieving 54% improvement in retrieval performance.
Details
Motivation: Most text-to-image retrieval systems are limited to RGB data and fail to exploit unique physical information from other sensors like SAR and multispectral data, which is crucial for applications like disaster response and climate monitoring.Method: CLOSP (Contrastive Language Optical SAR Pretraining) uses text as a bridge to align unpaired optical and SAR images into a unified embedding space. GeoCLOSP extends this by integrating geographic coordinates for location-dependent tasks.
Result: CLOSP achieves state-of-the-art performance with 54% improvement in nDGC@1000 retrieval metrics. The unified training transfers semantic knowledge from optical to SAR domain, and GeoCLOSP creates a trade-off between general semantic tasks and specialized location-dependent retrieval.
Conclusion: Integration of diverse sensor data and geographic context is essential for unlocking the full potential of remote sensing archives, enabling better crisis event retrieval and rare geographic feature identification.
Abstract: Retrieving relevant imagery from vast satellite archives is crucial for applications like disaster response and long-term climate monitoring. However, most text-to-image retrieval systems are limited to RGB data, failing to exploit the unique physical information captured by other sensors, such as the all-weather structural sensitivity of Synthetic Aperture Radar (SAR) or the spectral signatures in optical multispectral data. To bridge this gap, we introduce CrisisLandMark, a new large-scale corpus of over 647,000 Sentinel-1 SAR and Sentinel-2 multispectral images paired with structured textual annotations for land cover, land use, and crisis events harmonized from authoritative land cover systems (CORINE and Dynamic World) and crisis-specific sources. We then present CLOSP (Contrastive Language Optical SAR Pretraining), a novel framework that uses text as a bridge to align unpaired optical and SAR images into a unified embedding space. Our experiments show that CLOSP achieves a new state-of-the-art, improving retrieval nDGC@1000 by 54% over existing models. Additionally, we find that the unified training strategy overcomes the inherent difficulty of interpreting SAR imagery by transferring rich semantic knowledge from the optical domain with indirect interaction. Furthermore, GeoCLOSP, which integrates geographic coordinates into our framework, creates a powerful trade-off between generality and specificity: while the CLOSP excels at general semantic tasks, the GeoCLOSP becomes a specialized expert for retrieving location-dependent crisis events and rare geographic features. This work highlights that the integration of diverse sensor data and geographic context is essential for unlocking the full potential of remote sensing archives.
[151] The Impact of 2D Segmentation Backbones on Point Cloud Predictions Using 4D Radar
William L. Muckelroy III, Mohammed Alsakabi, John M. Dolan, Ozan K. Tonguz
Main category: cs.CV
TL;DR: This paper investigates how higher-capacity segmentation backbones affect the quality of LiDAR-like 3D point clouds generated from 4D Radars, finding that an optimal backbone can provide 23.7% improvement over state-of-the-art methods.
Details
Motivation: LiDAR's high cost limits adoption in autonomous driving systems, so researchers aim to create LiDAR-like point clouds using cheaper 4D Radars instead. Previous work showed progress but the impact of segmentation backbone capacity on point cloud quality wasn't fully explored.Method: The study builds on prior neural network approaches that use LiDAR point clouds as ground truth to train models that generate LiDAR-like 3D point clouds from 4D Radars. The research specifically investigates the effect of different segmentation backbone capacities on the quality of the produced point clouds.
Result: Results show that while very high-capacity models can actually hurt performance, there exists an optimal segmentation backbone capacity that provides a 23.7% improvement over the current state-of-the-art methods.
Conclusion: Careful selection of segmentation backbone capacity is crucial for generating high-quality LiDAR-like point clouds from 4D Radars, with an optimal capacity providing significant performance gains while avoiding the negative effects of excessive model complexity.
Abstract: LiDAR’s dense, sharp point cloud (PC) representations of the surrounding environment enable accurate perception and significantly improve road safety by offering greater scene awareness and understanding. However, LiDAR’s high cost continues to restrict the broad adoption of high-level Autonomous Driving (AD) systems in commercially available vehicles. Prior research has shown progress towards circumventing the need for LiDAR by training a neural network, using LiDAR point clouds as ground truth (GT), to produce LiDAR-like 3D point clouds using only 4D Radars. One of the best examples is a neural network created to train a more efficient radar target detector with a modular 2D convolutional neural network (CNN) backbone and a temporal coherence network at its core that uses the RaDelft dataset for training (see arXiv:2406.04723). In this work, we investigate the impact of higher-capacity segmentation backbones on the quality of the produced point clouds. Our results show that while very high-capacity models may actually hurt performance, an optimal segmentation backbone can provide a 23.7% improvement over the state-of-the-art (SOTA).
[152] Bias in the Picture: Benchmarking VLMs with Social-Cue News Images and LLM-as-Judge Assessment
Aravind Narayanan, Vahid Reza Khazaie, Shaina Raza
Main category: cs.CV
TL;DR: Researchers created a news-image benchmark to evaluate how large vision-language models (VLMs) reproduce harmful social stereotypes when visual cues like age, gender, race, clothing, or occupation are present.
Details
Motivation: Large VLMs can jointly interpret images and text but are prone to absorbing and reproducing harmful social stereotypes from visual cues, creating risks that need systematic investigation.Method: Developed a benchmark with 1,343 image-question pairs from diverse news outlets, annotated with ground-truth answers and demographic attributes. Evaluated state-of-the-art VLMs using an LLM as judge with human verification.
Result: Visual context systematically shifts model outputs in open-ended settings; bias prevalence varies across attributes and models (particularly high risk for gender and occupation); higher faithfulness doesn’t necessarily correspond to lower bias.
Conclusion: The study reveals systematic bias risks in VLMs and releases benchmark prompts, evaluation rubric, and code to support reproducible and fairness-aware multimodal assessment.
Abstract: Large vision-language models (VLMs) can jointly interpret images and text, but they are also prone to absorbing and reproducing harmful social stereotypes when visual cues such as age, gender, race, clothing, or occupation are present. To investigate these risks, we introduce a news-image benchmark consisting of 1,343 image-question pairs drawn from diverse outlets, which we annotated with ground-truth answers and demographic attributes (age, gender, race, occupation, and sports). We evaluate a range of state-of-the-art VLMs and employ a large language model (LLM) as judge, with human verification. Our findings show that: (i) visual context systematically shifts model outputs in open-ended settings; (ii) bias prevalence varies across attributes and models, with particularly high risk for gender and occupation; and (iii) higher faithfulness does not necessarily correspond to lower bias. We release the benchmark prompts, evaluation rubric, and code to support reproducible and fairness-aware multimodal assessment.
[153] GaussianSeal: Rooting Adaptive Watermarks for 3D Gaussian Generation Model
Runyi Li, Xuanyu Zhang, Chuhan Tong, Zhipei Xu, Jian Zhang
Main category: cs.CV
TL;DR: GaussianSeal is the first bit watermarking framework for 3D Gaussian Splatting generative models, enabling copyright protection by embedding and decoding watermarks from rendered 3D outputs with high accuracy and minimal impact on quality.
Details
Motivation: With AIGC technologies advancing into 3D object generation, there's a gap in copyright protection for 3DGS models compared to existing watermarking methods for images and text.Method: Incorporates adaptive bit modulation modules into the generative model’s network blocks, embedding watermarks during generation to enable bit decoding from rendered outputs.
Result: Outperforms post-processing watermarking approaches, achieving superior watermark decoding accuracy while preserving the quality of generated 3D objects.
Conclusion: GaussianSeal provides an effective copyright protection solution for 3DGS generative models with high precision and minimal training overhead.
Abstract: With the advancement of AIGC technologies, the modalities generated by models have expanded from images and videos to 3D objects, leading to an increasing number of works focused on 3D Gaussian Splatting (3DGS) generative models. Existing research on copyright protection for generative models has primarily concentrated on watermarking in image and text modalities, with little exploration into the copyright protection of 3D object generative models. In this paper, we propose the first bit watermarking framework for 3DGS generative models, named GaussianSeal, to enable the decoding of bits as copyright identifiers from the rendered outputs of generated 3DGS. By incorporating adaptive bit modulation modules into the generative model and embedding them into the network blocks in an adaptive way, we achieve high-precision bit decoding with minimal training overhead while maintaining the fidelity of the model’s outputs. Experiments demonstrate that our method outperforms post-processing watermarking approaches for 3DGS objects, achieving superior performance of watermark decoding accuracy and preserving the quality of the generated results.
[154] MoTiC: Momentum Tightness and Contrast for Few-Shot Class-Incremental Learning
Zeyu He, Shuai Huang, Yuwu Lu, Ming Zhao
Main category: cs.CV
TL;DR: This paper proposes MoTiC, a novel framework for Few-Shot Class-Incremental Learning (FSCIL) that addresses prototype estimation bias through Bayesian prior alignment, contrastive learning, and momentum self-supervision to improve feature tightness and reduce variance.
Details
Motivation: Existing FSCIL methods suffer from significant estimation bias in new-class prototypes due to extreme data scarcity, while base-class prototypes benefit from sufficient data. This imbalance leads to suboptimal performance in incremental learning scenarios.Method: The proposed MoTiC framework uses Bayesian analysis to align new-class priors with old-class statistics, large-scale contrastive learning for cross-category feature tightness, and integrates momentum self-supervision with virtual categories to enrich feature diversity and inject prior information.
Result: Experiments on three FSCIL benchmarks show state-of-the-art performance, particularly on the fine-grained CUB-200 task, demonstrating the method’s ability to reduce estimation bias and improve incremental learning robustness.
Conclusion: MoTiC effectively addresses the dual challenge of FSCIL by reducing prototype estimation variance through Bayesian alignment and enhancing feature space cohesion, leading to superior performance in few-shot incremental learning scenarios.
Abstract: Few-Shot Class-Incremental Learning (FSCIL) must contend with the dual challenge of learning new classes from scarce samples while preserving old class knowledge. Existing methods use the frozen feature extractor and class-averaged prototypes to mitigate against catastrophic forgetting and overfitting. However, new-class prototypes suffer significant estimation bias due to extreme data scarcity, whereas base-class prototypes benefit from sufficient data. In this work, we theoretically demonstrate that aligning the new-class priors with old-class statistics via Bayesian analysis reduces variance and improves prototype accuracy. Furthermore, we propose large-scale contrastive learning to enforce cross-category feature tightness. To further enrich feature diversity and inject prior information for new-class prototypes, we integrate momentum self-supervision and virtual categories into the Momentum Tightness and Contrast framework (MoTiC), constructing a feature space with rich representations and enhanced interclass cohesion. Experiments on three FSCIL benchmarks produce state-of-the-art performances, particularly on the fine-grained task CUB-200, validating our method’s ability to reduce estimation bias and improve incremental learning robustness.
[155] Deep Learning for Clouds and Cloud Shadow Segmentation in Methane Satellite and Airborne Imaging Spectroscopy
Manuel Perez-Carrasco, Maya Nasr, Sebastien Roche, Chris Chan Miller, Zhan Zhang, Core Francisco Park, Eleanor Walker, Cecilia Garraffo, Douglas Finkbeiner, Ritesh Gautam, Steven Wofsy
Main category: cs.CV
TL;DR: Machine learning methods for cloud and cloud shadow detection in hyperspectral remote sensing data, comparing conventional techniques with deep learning architectures to improve methane emission quantification.
Details
Motivation: Effective cloud and cloud shadow detection is critical for accurate atmospheric methane retrieval in hyperspectral remote sensing, especially for MethaneSAT and MethaneAIR missions, as clouds bias methane retrievals and impact emission quantification.Method: Evaluated conventional methods (Iterative Logistic Regression and Multilayer Perceptron) against deep learning architectures (UNet and Spectral Channel Attention Network) for cloud/shadow detection in high-resolution hyperspectral data.
Result: Deep learning models substantially outperformed conventional methods: UNet excelled at preserving spatial structure, while SCAN captured fine boundary details better and surpassed UNet on MethaneSAT data due to spectral attention mechanisms.
Conclusion: Advanced deep learning architectures provide robust, scalable solutions for cloud/shadow screening, enhancing methane emission quantification capacity for current and next-generation hyperspectral missions.
Abstract: Effective cloud and cloud shadow detection is a critical prerequisite for accurate retrieval of concentrations of atmospheric methane or other trace gases in hyperspectral remote sensing. This challenge is especially pertinent for MethaneSAT and for its airborne companion mission, MethaneAIR. In this study, we use machine learning methods to address the cloud and cloud shadow detection problem for sensors with these high spatial resolutions instruments. Cloud and cloud shadows in remote sensing data need to be effectively screened out as they bias methane retrievals in remote sensing imagery and impact the quantification of emissions. We deploy and evaluate conventional techniques including Iterative Logistic Regression (ILR) and Multilayer Perceptron (MLP), with advanced deep learning architectures, namely UNet and a Spectral Channel Attention Network (SCAN) method. Our results show that conventional methods struggle with spatial coherence and boundary definition, affecting the detection of clouds and cloud shadows. Deep learning models substantially improve detection quality: UNet performs best in preserving spatial structure, while SCAN excels at capturing fine boundary details. Notably, SCAN surpasses UNet on MethaneSAT data, underscoring the benefits of incorporating spectral attention for satellite specific features. This in depth assessment of various disparate machine learning techniques demonstrates the strengths and effectiveness of advanced deep learning architectures in providing robust, scalable solutions for clouds and cloud shadow screening towards enhancing methane emission quantification capacity of existing and next generation hyperspectral missions. Our data and code is publicly available at https://doi.org/10.7910/DVN/IKLZOJ
[156] Latent Wavelet Diffusion For Ultra-High-Resolution Image Synthesis
Luigi Sigillo, Shengfeng He, Danilo Comminiello
Main category: cs.CV
TL;DR: LWD is a lightweight training framework that improves detail and texture fidelity in ultra-high-resolution image synthesis using frequency-aware masking and wavelet energy maps, requiring no architectural changes or inference overhead.
Details
Motivation: High-resolution image synthesis faces challenges in balancing computational efficiency with preserving fine-grained visual detail, particularly at 2K-4K resolutions.Method: Introduces a frequency-aware masking strategy based on wavelet energy maps to dynamically focus training on detail-rich regions, combined with a scale-consistent VAE objective for spectral fidelity.
Result: LWD consistently improves perceptual quality and FID scores across multiple baselines without adding inference costs or requiring architectural modifications.
Conclusion: Signal-driven supervision through wavelet-based approaches provides a principled and efficient path for high-resolution generative modeling.
Abstract: High-resolution image synthesis remains a core challenge in generative modeling, particularly in balancing computational efficiency with the preservation of fine-grained visual detail. We present Latent Wavelet Diffusion (LWD), a lightweight training framework that significantly improves detail and texture fidelity in ultra-high-resolution (2K-4K) image synthesis. LWD introduces a novel, frequency-aware masking strategy derived from wavelet energy maps, which dynamically focuses the training process on detail-rich regions of the latent space. This is complemented by a scale-consistent VAE objective to ensure high spectral fidelity. The primary advantage of our approach is its efficiency: LWD requires no architectural modifications and adds zero additional cost during inference, making it a practical solution for scaling existing models. Across multiple strong baselines, LWD consistently improves perceptual quality and FID scores, demonstrating the power of signal-driven supervision as a principled and efficient path toward high-resolution generative modeling.
[157] Enhancing Transformer-Based Vision Models: Addressing Feature Map Anomalies Through Novel Optimization Strategies
Sumit Mamtani
Main category: cs.CV
TL;DR: Proposes two lightweight optimization techniques (STA and ANF) to reduce structured noise artifacts in Vision Transformers and improve feature map interpretability for downstream tasks.
Details
Motivation: Vision Transformers suffer from structured noise artifacts in feature maps that hinder downstream applications like segmentation and depth estimation.Method: Structured Token Augmentation (STA) enhances token diversity through spatial perturbations during tokenization, and Adaptive Noise Filtering (ANF) applies learnable inline denoising between transformer layers.
Result: Experimental results across ImageNet, Ade20k, and NYUv2 benchmarks show consistent improvements in visual quality and task performance.
Conclusion: The proposed methods are architecture-agnostic and practically effective for improving Vision Transformer interpretability and mitigating noise artifacts.
Abstract: Vision Transformers (ViTs) have demonstrated superior performance across a wide range of computer vision tasks. However, structured noise artifacts in their feature maps hinder downstream applications such as segmentation and depth estimation. We propose two novel and lightweight optimisation techniques- Structured Token Augmentation (STA) and Adaptive Noise Filtering (ANF)- to improve interpretability and mitigate these artefacts. STA enhances token diversity through spatial perturbations during tokenisation, while ANF applies learnable inline denoising between transformer layers. These methods are architecture-agnostic and evaluated across standard benchmarks, including ImageNet, Ade20k, and NYUv2. Experimental results show consistent improvements in visual quality and task performance, highlighting the practical effectiveness of our approach.
[158] From Prompt to Progression: Taming Video Diffusion Models for Seamless Attribute Transition
Ling Lo, Kelvin C. K. Chan, Wen-Huang Cheng, Ming-Hsuan Yang
Main category: cs.CV
TL;DR: A method for smooth attribute transitions in video generation using frame-wise guidance during denoising, with a new benchmark and metrics for evaluation.
Details
Motivation: Existing models struggle with gradual attribute transitions in videos, often producing inconsistencies when using prompt interpolation methods.Method: Proposes frame-wise guidance during denoising that constructs transitional directions for each noisy latent, enabling gradual attribute shifts while preserving motion dynamics.
Result: The approach outperforms existing baselines, achieving visual fidelity, text alignment, and seamless attribute transitions.
Conclusion: The method effectively handles gradual attribute transitions in video generation and is supported by a new benchmark (CAT-Bench) and evaluation metrics.
Abstract: Existing models often struggle with complex temporal changes, particularly when generating videos with gradual attribute transitions. The most common prompt interpolation approach for motion transitions often fails to handle gradual attribute transitions, where inconsistencies tend to become more pronounced. In this work, we propose a simple yet effective method to extend existing models for smooth and consistent attribute transitions, through introducing frame-wise guidance during the denoising process. Our approach constructs a data-specific transitional direction for each noisy latent, guiding the gradual shift from initial to final attributes frame by frame while preserving the motion dynamics of the video. Moreover, we present the Controlled-Attribute-Transition Benchmark (CAT-Bench), which integrates both attribute and motion dynamics, to comprehensively evaluate the performance of different models. We further propose two metrics to assess the accuracy and smoothness of attribute transitions. Experimental results demonstrate that our approach performs favorably against existing baselines, achieving visual fidelity, maintaining alignment with text prompts, and delivering seamless attribute transitions. Code and CATBench are released: https://github.com/lynn-ling-lo/Prompt2Progression.
[159] Anatomically Constrained Transformers for Cardiac Amyloidosis Classification
Alexander Thorley, Agis Chartsias, Jordan Strom, Roberto Lang, Jeremy Slivnick, Jamie O’Driscoll, Rajan Sharma, Dipak Kotecha, Jinming Duan, Alberto Gomez
Main category: cs.CV
TL;DR: A transformer model constrained to myocardium regions improves cardiac amyloidosis classification by focusing on clinically relevant anatomical features rather than full video analysis.
Details
Motivation: Current neural network approaches for cardiac amyloidosis detection use full video clips but lack assurance that classification is based on clinically relevant features known to be associated with the disease.Method: The authors constrain a transformer model to the myocardium region where CA abnormalities occur, representing it as deforming points and corresponding image patches. They also apply this anatomical constraint to masked autoencoder pre-training by masking and reconstructing only anatomical patches.
Result: The anatomically constrained model achieves increased performance on CA classification compared to full video transformers, with explicit guarantee that classification focuses only on anatomical regions.
Conclusion: Constraining both the transformer and pre-training task to the myocardium where CA imaging features are localized improves classification performance while providing interpretability through attention visualization over the deforming myocardium.
Abstract: Cardiac amyloidosis (CA) is a rare cardiomyopathy, with typical abnormalities in clinical measurements from echocardiograms such as reduced global longitudinal strain of the myocardium. An alternative approach for detecting CA is via neural networks, using video classification models such as convolutional neural networks. These models process entire video clips, but provide no assurance that classification is based on clinically relevant features known to be associated with CA. An alternative paradigm for disease classification is to apply models to quantitative features such as strain, ensuring that the classification relates to clinically relevant features. Drawing inspiration from this approach, we explicitly constrain a transformer model to the anatomical region where many known CA abnormalities occur – the myocardium, which we embed as a set of deforming points and corresponding sampled image patches into input tokens. We show that our anatomical constraint can also be applied to the popular self-supervised learning masked autoencoder pre-training, where we propose to mask and reconstruct only anatomical patches. We show that by constraining both the transformer and pre-training task to the myocardium where CA imaging features are localized, we achieve increased performance on a CA classification task compared to full video transformers. Our model provides an explicit guarantee that the classification is focused on only anatomical regions of the echo, and enables us to visualize transformer attention scores over the deforming myocardium.
[160] Learning to Stop: Reinforcement Learning for Efficient Patient-Level Echocardiographic Classification
Woo-Jin Cho Kim, Jorge Oliveira, Arian Beqiri, Alex Thorley, Jordan Strom, Jamie O’Driscoll, Rajan Sharma, Jeremy Slivnick, Roberto Lang, Alberto Gomez, Agisilaos Chartsias
Main category: cs.CV
TL;DR: Proposes a reinforcement learning method to select optimal subsets of echocardiogram clips for disease classification, achieving better performance with only 30% of clips than using all clips.
Details
Motivation: Current automated methods for echocardiogram analysis either use single clips (ignoring complementary information) or all clips (computationally expensive), creating a need for efficient clip selection.Method: Uses reinforcement learning where an agent learns to process view-specific clips to reduce classification uncertainty or stop when confidence is sufficient, combined with learnable attention-based aggregation for fusing information.
Result: Achieved AUC of 0.91 for detecting cardiac amyloidosis using only 30% of all clips, outperforming methods using all clips and other benchmarks.
Conclusion: The proposed method provides an efficient way to select optimal clip subsets for echocardiogram analysis, reducing computational cost while maintaining or improving classification performance.
Abstract: Guidelines for transthoracic echocardiographic examination recommend the acquisition of multiple video clips from different views of the heart, resulting in a large number of clips. Typically, automated methods, for instance disease classifiers, either use one clip or average predictions from all clips. Relying on one clip ignores complementary information available from other clips, while using all clips is computationally expensive and may be prohibitive for clinical adoption. To select the optimal subset of clips that maximize performance for a specific task (image-based disease classification), we propose a method optimized through reinforcement learning. In our method, an agent learns to either keep processing view-specific clips to reduce the disease classification uncertainty, or stop processing if the achieved classification confidence is sufficient. Furthermore, we propose a learnable attention-based aggregation method as a flexible way of fusing information from multiple clips. The proposed method obtains an AUC of 0.91 on the task of detecting cardiac amyloidosis using only 30% of all clips, exceeding the performance achieved from using all clips and from other benchmarks.
[161] Towards Robust In-Context Learning for Medical Image Segmentation via Data Synthesis
Jiesi Hu, Yanwu Yang, Zhiyu Ye, Chenfei Ye, Hanyang Peng, Jianfeng Cao, Ting Ma
Main category: cs.CV
TL;DR: SynthICL is a novel data synthesis framework using domain randomization to address data scarcity in medical image segmentation for In-Context Learning (ICL), achieving significant performance improvements and enhanced generalization.
Details
Motivation: The rise of ICL for medical image segmentation creates unprecedented demand for large-scale diverse datasets, exacerbating data scarcity. Existing synthesis methods fail to achieve both high diversity and suitable domain distribution for medical data.Method: Built upon domain randomization, SynthICL leverages anatomical priors from real datasets to ensure realism, generates diverse anatomical structures for broad data distribution, and explicitly models inter-subject variations to create suitable ICL data cohorts.
Result: Extensive experiments on four held-out datasets show models trained with SynthICL data achieve up to 63% average Dice improvement and substantially enhanced generalization to unseen anatomical domains.
Conclusion: SynthICL helps mitigate the data bottleneck for ICL-based segmentation, paving the way for robust models in medical image analysis.
Abstract: The rise of In-Context Learning (ICL) for universal medical image segmentation has introduced an unprecedented demand for large-scale, diverse datasets for training, exacerbating the long-standing problem of data scarcity. While data synthesis offers a promising solution, existing methods often fail to simultaneously achieve both high data diversity and a domain distribution suitable for medical data. To bridge this gap, we propose \textbf{SynthICL}, a novel data synthesis framework built upon domain randomization. SynthICL ensures realism by leveraging anatomical priors from real-world datasets, generates diverse anatomical structures to cover a broad data distribution, and explicitly models inter-subject variations to create data cohorts suitable for ICL. Extensive experiments on four held-out datasets validate our framework’s effectiveness, showing that models trained with our data achieve performance gains of up to 63% in average Dice and substantially enhanced generalization to unseen anatomical domains. Our work helps mitigate the data bottleneck for ICL-based segmentation, paving the way for robust models. Our code and the generated dataset are publicly available at https://github.com/jiesihu/Neuroverse3D.
[162] VIMD: Monocular Visual-Inertial Motion and Depth Estimation
Saimouli Katragadda, Guoquan Huang
Main category: cs.CV
TL;DR: VIMD is a monocular visual-inertial motion and depth learning framework that estimates dense metric depth by leveraging MSCKF-based motion tracking and iteratively refining per-pixel scale using multi-view information.
Details
Motivation: Accurate and efficient dense metric depth estimation is crucial for 3D visual perception in robotics and XR applications, requiring practical solutions for resource-constrained settings.Method: The framework exploits multi-view information to iteratively refine per-pixel scale (instead of globally fitting an invariant affine model) and is highly modular to work with various depth estimation backbones using MSCKF-based monocular visual-inertial motion tracking.
Result: VIMD achieves exceptional accuracy and robustness on TartanAir and VOID datasets, with strong zero-shot generalization on AR Table dataset, even with extremely sparse points (10-20 metric depth points per image).
Conclusion: VIMD is a practical solution for deployment in resource-constrained settings with robust performance and strong generalization capabilities across various scenarios.
Abstract: Accurate and efficient dense metric depth estimation is crucial for 3D visual perception in robotics and XR. In this paper, we develop a monocular visual-inertial motion and depth (VIMD) learning framework to estimate dense metric depth by leveraging accurate and efficient MSCKF-based monocular visual-inertial motion tracking. At the core the proposed VIMD is to exploit multi-view information to iteratively refine per-pixel scale, instead of globally fitting an invariant affine model as in the prior work. The VIMD framework is highly modular, making it compatible with a variety of existing depth estimation backbones. We conduct extensive evaluations on the TartanAir and VOID datasets and demonstrate its zero-shot generalization capabilities on the AR Table dataset. Our results show that VIMD achieves exceptional accuracy and robustness, even with extremely sparse points as few as 10-20 metric depth points per image. This makes the proposed VIMD a practical solution for deployment in resource constrained settings, while its robust performance and strong generalization capabilities offer significant potential across a wide range of scenarios.
[163] Frequency-domain Multi-modal Fusion for Language-guided Medical Image Segmentation
Bo Yu, Jianhua Yang, Zetao Du, Yan Huang, Chenglong Li, Liang Wang
Main category: cs.CV
TL;DR: FMISeg is a frequency-domain multi-modal interaction model that improves medical image segmentation by integrating clinical text reports with visual features in the frequency domain to enhance representation and suppress irrelevant information.
Details
Motivation: Existing methods struggle with complex lesion morphological changes and the semantic gap between vision-language modalities, leading to suboptimal segmentation performance in pulmonary infectious disease diagnosis.Method: Proposes a late fusion model with Frequency-domain Feature Bidirectional Interaction (FFBI) module for effective frequency-domain feature fusion, and Language-guided Frequency-domain Feature Interaction (LFFI) module to suppress semantically irrelevant visual features using linguistic guidance.
Result: Experiments on QaTa-COV19 and MosMedData+ datasets show that FMISeg outperforms state-of-the-art methods both qualitatively and quantitatively.
Conclusion: The frequency-domain approach with multi-modal interaction effectively bridges the vision-language gap and improves segmentation accuracy for pulmonary infectious diseases.
Abstract: Automatically segmenting infected areas in radiological images is essential for diagnosing pulmonary infectious diseases. Recent studies have demonstrated that the accuracy of the medical image segmentation can be improved by incorporating clinical text reports as semantic guidance. However, the complex morphological changes of lesions and the inherent semantic gap between vision-language modalities prevent existing methods from effectively enhancing the representation of visual features and eliminating semantically irrelevant information, ultimately resulting in suboptimal segmentation performance. To address these problems, we propose a Frequency-domain Multi-modal Interaction model (FMISeg) for language-guided medical image segmentation. FMISeg is a late fusion model that establishes interaction between linguistic features and frequency-domain visual features in the decoder. Specifically, to enhance the visual representation, our method introduces a Frequency-domain Feature Bidirectional Interaction (FFBI) module to effectively fuse frequency-domain features. Furthermore, a Language-guided Frequency-domain Feature Interaction (LFFI) module is incorporated within the decoder to suppress semantically irrelevant visual features under the guidance of linguistic information. Experiments on QaTa-COV19 and MosMedData+ demonstrated that our method outperforms the state-of-the-art methods qualitatively and quantitatively.
[164] PolGS: Polarimetric Gaussian Splatting for Fast Reflective Surface Reconstruction
Yufei Han, Bowen Tie, Heng Guo, Youwei Lyu, Si Li, Boxin Shi, Yunpeng Jia, Zhanyu Ma
Main category: cs.CV
TL;DR: PolGS integrates polarimetric constraints into 3D Gaussian Splatting to enable fast (10-minute) reconstruction of reflective surfaces by separating specular and diffuse components.
Details
Motivation: 3D Gaussian Splatting methods are fast for novel view rendering but lag behind implicit neural representations in reconstructing surfaces with complex reflective properties, which is crucial for real-time virtual reality applications.Method: PolGS integrates polarimetric constraints into the 3DGS framework to effectively separate specular and diffuse reflection components, enhancing reconstruction quality for challenging reflective materials.
Result: Experimental results on synthetic and real-world datasets validate the effectiveness of PolGS for reflective surface reconstruction.
Conclusion: PolGS enables fast (10-minute) high-quality reconstruction of reflective surfaces by leveraging polarimetric constraints within the 3D Gaussian Splatting framework.
Abstract: Efficient shape reconstruction for surfaces with complex reflectance properties is crucial for real-time virtual reality. While 3D Gaussian Splatting (3DGS)-based methods offer fast novel view rendering by leveraging their explicit surface representation, their reconstruction quality lags behind that of implicit neural representations, particularly in the case of recovering surfaces with complex reflective reflectance. To address these problems, we propose PolGS, a Polarimetric Gaussian Splatting model allowing fast reflective surface reconstruction in 10 minutes. By integrating polarimetric constraints into the 3DGS framework, PolGS effectively separates specular and diffuse components, enhancing reconstruction quality for challenging reflective materials. Experimental results on the synthetic and real-world dataset validate the effectiveness of our method.
[165] CAMILA: Context-Aware Masking for Image Editing with Language Alignment
Hyunseung Kim, Chiho Choi, Srikanth Malla, Sai Prahladh Padmanabhan, Saurabh Bagchi, Joon Hee Choi
Main category: cs.CV
TL;DR: CAMILA is a context-aware image editing method that validates instruction feasibility before applying edits, preventing nonsensical outputs from infeasible or contradictory text instructions.
Details
Motivation: Existing image editing models naively follow all user instructions, even infeasible or contradictory ones, leading to nonsensical outputs. There's a need for methods that can validate instruction coherence with image context.Method: CAMILA uses context-aware masking to validate contextual coherence between instructions and images, ensuring only relevant edits are applied to designated regions while ignoring non-executable instructions.
Result: The method achieves better performance and higher semantic alignment than state-of-the-art models, effectively handling complex instruction challenges while preserving image integrity.
Conclusion: CAMILA demonstrates effectiveness in addressing the challenge of infeasible instructions in text-guided image editing, providing a more robust solution for practical applications.
Abstract: Text-guided image editing has been allowing users to transform and synthesize images through natural language instructions, offering considerable flexibility. However, most existing image editing models naively attempt to follow all user instructions, even if those instructions are inherently infeasible or contradictory, often resulting in nonsensical output. To address these challenges, we propose a context-aware method for image editing named as CAMILA (Context-Aware Masking for Image Editing with Language Alignment). CAMILA is designed to validate the contextual coherence between instructions and the image, ensuring that only relevant edits are applied to the designated regions while ignoring non-executable instructions. For comprehensive evaluation of this new method, we constructed datasets for both single- and multi-instruction image editing, incorporating the presence of infeasible requests. Our method achieves better performance and higher semantic alignment than state-of-the-art models, demonstrating its effectiveness in handling complex instruction challenges while preserving image integrity.
[166] Robust RGB-T Tracking via Learnable Visual Fourier Prompt Fine-tuning and Modality Fusion Prompt Generation
Hongtao Yang, Bineng Zhong, Qihua Liang, Zhiruo Zhu, Yaozong Zheng, Ning Li
Main category: cs.CV
TL;DR: VFPTrack is an efficient RGB-Thermal tracking method that combines spatial and frequency-domain prompts using Fast Fourier Transform for better feature extraction and modality fusion.
Details
Motivation: Existing PEFT-based RGB-T tracking methods rely only on spatial domain information, overlooking the importance of frequency-domain information in prompt learning, which limits their performance.Method: Uses symmetric feature extraction encoder with shared parameters, visual Fourier prompts, and Modality Fusion Prompt Generator. Combines spatial visual prompts with frequency-domain prompts from FFT, and generates fused modality prompts for bidirectional interaction between modalities.
Result: Extensive experiments on three popular RGB-T tracking benchmarks show outstanding performance.
Conclusion: The proposed VFPTrack method effectively leverages both spatial and frequency-domain information to achieve superior RGB-T tracking performance through comprehensive modality feature interaction.
Abstract: Recently, visual prompt tuning is introduced to RGB-Thermal (RGB-T) tracking as a parameter-efficient finetuning (PEFT) method. However, these PEFT-based RGB-T tracking methods typically rely solely on spatial domain information as prompts for feature extraction. As a result, they often fail to achieve optimal performance by overlooking the crucial role of frequency-domain information in prompt learning. To address this issue, we propose an efficient Visual Fourier Prompt Tracking (named VFPTrack) method to learn modality-related prompts via Fast Fourier Transform (FFT). Our method consists of symmetric feature extraction encoder with shared parameters, visual fourier prompts, and Modality Fusion Prompt Generator that generates bidirectional interaction prompts through multi-modal feature fusion. Specifically, we first use a frozen feature extraction encoder to extract RGB and thermal infrared (TIR) modality features. Then, we combine the visual prompts in the spatial domain with the frequency domain prompts obtained from the FFT, which allows for the full extraction and understanding of modality features from different domain information. Finally, unlike previous fusion methods, the modality fusion prompt generation module we use combines features from different modalities to generate a fused modality prompt. This modality prompt is interacted with each individual modality to fully enable feature interaction across different modalities. Extensive experiments conducted on three popular RGB-T tracking benchmarks show that our method demonstrates outstanding performance.
[167] Rectified Decoupled Dataset Distillation: A Closer Look for Fair and Comprehensive Evaluation
Xinhao Zhong, Shuoyang Sun, Xulin Gu, Chenyang Zhu, Bin Chen, Yaowei Wang
Main category: cs.CV
TL;DR: RD³ proposes a standardized evaluation protocol for decoupled dataset distillation methods, revealing that performance differences in existing methods stem from inconsistent evaluation procedures rather than methodological advances.
Details
Motivation: Existing decoupled dataset distillation methods suffer from inconsistent post-evaluation protocols, which hinders progress and fair comparisons in the field.Method: RD³ systematically investigates how different post-evaluation settings affect test accuracy and examines whether reported performance differences reflect true methodological advances or evaluation discrepancies.
Result: Analysis reveals that much performance variation is attributed to inconsistent evaluation rather than differences in synthetic data quality. General strategies that improve distilled datasets across settings are identified.
Conclusion: RD³ establishes a standardized benchmark and rigorous evaluation protocol to provide a foundation for fair and reproducible comparisons in future dataset distillation research.
Abstract: Dataset distillation aims to generate compact synthetic datasets that enable models trained on them to achieve performance comparable to those trained on full real datasets, while substantially reducing storage and computational costs. Early bi-level optimization methods (e.g., MTT) have shown promising results on small-scale datasets, but their scalability is limited by high computational overhead. To address this limitation, recent decoupled dataset distillation methods (e.g., SRe$^2$L) separate the teacher model pre-training from the synthetic data generation process. These methods also introduce random data augmentation and epoch-wise soft labels during the post-evaluation phase to improve performance and generalization. However, existing decoupled distillation methods suffer from inconsistent post-evaluation protocols, which hinders progress in the field. In this work, we propose Rectified Decoupled Dataset Distillation (RD$^3$), and systematically investigate how different post-evaluation settings affect test accuracy. We further examine whether the reported performance differences across existing methods reflect true methodological advances or stem from discrepancies in evaluation procedures. Our analysis reveals that much of the performance variation can be attributed to inconsistent evaluation rather than differences in the intrinsic quality of the synthetic data. In addition, we identify general strategies that improve the effectiveness of distilled datasets across settings. By establishing a standardized benchmark and rigorous evaluation protocol, RD$^3$ provides a foundation for fair and reproducible comparisons in future dataset distillation research.
[168] nnFilterMatch: A Unified Semi-Supervised Learning Framework with Uncertainty-Aware Pseudo-Label Filtering for Efficient Medical Segmentation
Yi Yang
Main category: cs.CV
TL;DR: A novel semi-supervised learning framework (nnFilterMatch) that integrates SSL with entropy-based pseudo-label filtering within nnU-Net, eliminating retraining loops while achieving performance comparable to fully supervised models with only 5-20% labeled data.
Details
Motivation: Conventional SSL-AL hybrid approaches require iterative retraining cycles after each annotation round, causing significant computational overhead and limiting scalability in clinical applications.Method: Integrates SSL with entropy-based pseudo-label filtering (FilterMatch) within the single-pass nnU-Net training framework, selectively excluding high-confidence pseudo-labels during training to avoid retraining loops while preserving uncertainty-guided learning benefits.
Result: Validated across multiple clinical segmentation benchmarks, achieving performance comparable to or exceeding fully supervised models with only 5-20% labeled data.
Conclusion: Introduces a scalable, end-to-end learning strategy for reducing annotation demands in medical image segmentation without compromising accuracy, with code publicly available.
Abstract: Semi-supervised learning (SSL) has emerged as a promising paradigm in medical image segmentation, offering competitive performance while substantially reducing the need for extensive manual annotation. When combined with active learning (AL), these strategies further minimize annotation burden by selectively incorporating the most informative samples. However, conventional SSL_AL hybrid approaches often rely on iterative and loop-based retraining cycles after each annotation round, incurring significant computational overhead and limiting scalability in clinical applications. In this study, we present a novel, annotation-efficient, and self-adaptive deep segmentation framework that integrates SSL with entropy-based pseudo-label filtering (FilterMatch), an AL-inspired mechanism, within the single-pass nnU-Net training segmentation framework (nnFilterMatch). By selectively excluding high-confidence pseudo-labels during training, our method circumvents the need for retraining loops while preserving the benefits of uncertainty-guided learning. We validate the proposed framework across multiple clinical segmentation benchmarks and demonstrate that it achieves performance comparable to or exceeding fully supervised models, even with only 5%–20% labeled data. This work introduces a scalable, end-to-end learning strategy for reducing annotation demands in medical image segmentation without compromising accuracy. Code is available here: https://github.com/Ordi117/nnFilterMatch.git.
[169] Talking Head Generation via AU-Guided Landmark Prediction
Shao-Yu Chang, Jingyi Xu, Hieu Le, Dimitris Samaras
Main category: cs.CV
TL;DR: A two-stage framework for audio-driven talking head generation with fine-grained expression control using explicit facial Action Units (AUs) mapped to 2D facial landmarks.
Details
Motivation: To achieve more precise and physically grounded expression control compared to prior methods that rely on emotion labels or implicit AU conditioning.Method: First stage: variational motion generator predicts temporally coherent landmark sequences from audio and AU intensities. Second stage: diffusion-based synthesizer generates realistic videos conditioned on landmarks and reference image.
Result: Outperforms state-of-the-art baselines on MEAD dataset across multiple metrics, showing improved expression accuracy, temporal stability, and visual realism.
Conclusion: Explicit AU-to-landmark modeling is effective for expressive talking head generation, with the two-stage separation of motion and appearance yielding superior performance.
Abstract: We propose a two-stage framework for audio-driven talking head generation with fine-grained expression control via facial Action Units (AUs). Unlike prior methods relying on emotion labels or implicit AU conditioning, our model explicitly maps AUs to 2D facial landmarks, enabling physically grounded, per-frame expression control. In the first stage, a variational motion generator predicts temporally coherent landmark sequences from audio and AU intensities. In the second stage, a diffusion-based synthesizer generates realistic, lip-synced videos conditioned on these landmarks and a reference image. This separation of motion and appearance improves expression accuracy, temporal stability, and visual realism. Experiments on the MEAD dataset show that our method outperforms state-of-the-art baselines across multiple metrics, demonstrating the effectiveness of explicit AU-to-landmark modeling for expressive talking head generation.
[170] ExpFace: Exponential Angular Margin Loss for Deep Face Recognition
Jinhui Zheng, Xueyuan Gong
Main category: cs.CV
TL;DR: ExpFace is a new margin-based softmax loss that introduces an angular exponential term to better handle noisy samples in face recognition by applying larger penalties to center-region clean samples and smaller penalties to peripheral noisy samples.
Details
Motivation: Existing margin-based softmax losses (SphereFace, CosFace, ArcFace) enhance intra-class compactness and inter-class separability but overlook the impact of noisy samples, which tend to shift toward peripheral regions in angular space while clean samples cluster in the center.Method: Proposes Exponential Angular Margin Loss (ExpFace) with an angular exponential term as margin, applying larger penalty in center region and smaller penalty in peripheral region to emphasize clean samples while suppressing noisy samples.
Result: Extensive experiments show ExpFace achieves state-of-the-art performance, avoids training instability of SphereFace and non-monotonicity of ArcFace, and exhibits similarity curves that align with decision boundaries in angular space.
Conclusion: ExpFace provides superior handling of noisy samples while maintaining strong discriminative power, with source code released for future research.
Abstract: Face recognition is an open-set problem requiring high discriminative power to ensure that intra-class distances remain smaller than inter-class distances. Margin-based softmax losses, such as SphereFace, CosFace, and ArcFace, have been widely adopted to enhance intra-class compactness and inter-class separability, yet they overlook the impact of noisy samples. By examining the distribution of samples in the angular space, we observe that clean samples predominantly cluster in the center region, whereas noisy samples tend to shift toward the peripheral region. Motivated by this observation, we propose the Exponential Angular Margin Loss (ExpFace), which introduces an angular exponential term as the margin. This design applies a larger penalty in the center region and a smaller penalty in the peripheral region within the angular space, thereby emphasizing clean samples while suppressing noisy samples. We present a unified analysis of ExpFace and classical margin-based softmax losses in terms of margin embedding forms, similarity curves, and gradient curves, showing that ExpFace not only avoids the training instability of SphereFace and the non-monotonicity of ArcFace, but also exhibits a similarity curve that applies penalties in the same manner as the decision boundary in the angular space. Extensive experiments demonstrate that ExpFace achieves state-of-the-art performance. To facilitate future research, we have released the source code at: https://github.com/dfr-code/ExpFace.
[171] Logics-Parsing Technical Report
Xiangyang Chen, Shuzhao Li, Xiuwen Zhu, Yongfan Chen, Fan Yang, Cheng Fang, Lin Qu, Xiaoxiao Xu, Hu Wei, Minggang Wu
Main category: cs.CV
TL;DR: Logics-Parsing is an end-to-end LVLM-based document parsing model enhanced with reinforcement learning that addresses layout analysis and reading order challenges in complex documents like multi-column newspapers.
Details
Motivation: Current LVLM-based document parsing methods lack explicit analytical stages for document layouts and reading orders, limiting their capability to handle complex document types such as multi-column newspapers or posters.Method: Proposes Logics-Parsing: an end-to-end LVLM model augmented with reinforcement learning with carefully designed reward mechanisms for layout analysis and reading order inference. Also incorporates diverse data types (chemical formulas, handwritten Chinese characters) into supervised fine-tuning.
Result: Comprehensive experiments on LogicsParsingBench (1,078 page-level PDF images across 9 categories) validate the model’s efficacy and State-of-the-art performance across diverse document analysis scenarios.
Conclusion: The proposed Logics-Parsing model demonstrates superior performance in handling complex document types through reinforcement learning-enhanced layout analysis and reading order inference, with broad applicability across diverse document formats.
Abstract: Recent advances in Large Vision-Language models (LVLM) have spurred significant progress in document parsing task. Compared to traditional pipeline-based methods, end-to-end paradigms have shown their excellence in converting PDF images into structured outputs through integrated Optical Character Recognition (OCR), table recognition, mathematical formula recognition and so on. However, the absence of explicit analytical stages for document layouts and reading orders limits the LVLM’s capability in handling complex document types such as multi-column newspapers or posters. To address this limitation, we propose in this report Logics-Parsing: an end-to-end LVLM-based model augmented with reinforcement learning. Our model incorporates meticulously designed reward mechanisms to optimize complex layout analysis and reading order inference. In addition, we expand the model’s versatility by incorporating diverse data types such as chemical formulas and handwritten Chinese characters into supervised fine-tuning. Finally, to enable rigorous evaluation of our approach, we introduce LogicsParsingBench, a curated set of 1,078 page-level PDF images spanning nine major categories and over twenty sub-categories, which will be released later. Comprehensive experiments conducted on LogicsParsingBench have validated the efficacy and State-of-the-art (SOTA) performance of our proposed model across diverse document analysis scenarios. Project Page: https://github.com/alibaba/Logics-Parsing
[172] Sex-based Bias Inherent in the Dice Similarity Coefficient: A Model Independent Analysis for Multiple Anatomical Structures
Hartmut Häntze, Myrthe Buser, Alessa Hering, Lisa C. Adams, Keno K. Bressem
Main category: cs.CV
TL;DR: The study reveals that the Dice Similarity Coefficient (DSC) metric itself introduces sex-based bias in medical image segmentation evaluation, as it penalizes errors more heavily in smaller structures, leading to systematically lower scores for women due to their smaller average organ volumes.
Details
Motivation: Previous work examined sex-based differences in models or datasets, but no study had investigated the potential bias introduced by the DSC metric itself, which penalizes segmentation errors more heavily in smaller structures - a concern given organ size differences between sexes.Method: Applied equally-sized synthetic errors to manual MRI annotations from 50 participants to ensure sex-based comparability in an idealized setting independent of specific models, quantifying sex-based differences of both DSC and normalized DSC.
Result: Even minimal errors (e.g., 1 mm boundary shift) produced systematic DSC differences between sexes: average differences around 0.03 for small structures, 0.01 for medium-sized structures, with only large structures (lungs, liver) mostly unaffected.
Conclusion: Fairness studies using DSC should not expect identical scores between sexes, as the metric itself introduces bias. A model may perform equally well across sexes in error magnitude even if DSC values suggest otherwise. Recognizing this metric-induced bias is essential for fair evaluations.
Abstract: Overlap-based metrics such as the Dice Similarity Coefficient (DSC) penalize segmentation errors more heavily in smaller structures. As organ size differs by sex, this implies that a segmentation error of equal magnitude may result in lower DSCs in women due to their smaller average organ volumes compared to men. While previous work has examined sex-based differences in models or datasets, no study has yet investigated the potential bias introduced by the DSC itself. This study quantifies sex-based differences of the DSC and the normalized DSC in an idealized setting independent of specific models. We applied equally-sized synthetic errors to manual MRI annotations from 50 participants to ensure sex-based comparability. Even minimal errors (e.g., a 1 mm boundary shift) produced systematic DSC differences between sexes. For small structures, average DSC differences were around 0.03; for medium-sized structures around 0.01. Only large structures (i.e., lungs and liver) were mostly unaffected, with sex-based DSC differences close to zero. These findings underline that fairness studies using the DSC as an evaluation metric should not expect identical scores between men and women, as the metric itself introduces bias. A segmentation model may perform equally well across sexes in terms of error magnitude, even if observed DSC values suggest otherwise. Importantly, our work raises awareness of a previously underexplored source of sex-based differences in segmentation performance. One that arises not from model behavior, but from the metric itself. Recognizing this factor is essential for more accurate and fair evaluations in medical image analysis.
[173] EfficienT-HDR: An Efficient Transformer-Based Framework via Multi-Exposure Fusion for HDR Reconstruction
Yu-Shen Huang, Tzu-Han Chen, Cheng-Yen Hsiao, Shaou-Gang Miaou
Main category: cs.CV
TL;DR: A lightweight Vision Transformer architecture for HDR imaging on edge devices that reduces computational costs by 67% while eliminating ghosting artifacts through novel fusion and optimization techniques.
Details
Motivation: High-quality HDR imaging is crucial for edge applications like surveillance and autonomous driving, but existing Multi-Exposure Fusion methods suffer from high computational costs and ghosting artifacts that limit deployment.Method: Proposes a lightweight Vision Transformer based on Context-Aware Vision Transformer, using YCbCr color space conversion, Intersection-Aware Adaptive Fusion module to suppress ghosting, and optimization techniques including Inverted Residual Embedding, Dynamic Tanh, and Enhanced Multi-Scale Dilated Convolution.
Result: Main version reduces FLOPS by ~67% and increases inference speed by 5x on CPU and 2.5x on edge devices compared to baseline, while maintaining high visual quality.
Conclusion: The method provides an efficient, ghost-free HDR imaging solution for edge devices with excellent balance between performance and image quality across various dynamic scenarios.
Abstract: Achieving high-quality High Dynamic Range (HDR) imaging on resource-constrained edge devices is a critical challenge in computer vision, as its performance directly impacts downstream tasks such as intelligent surveillance and autonomous driving. Multi-Exposure Fusion (MEF) is a mainstream technique to achieve this goal; however, existing methods generally face the dual bottlenecks of high computational costs and ghosting artifacts, hindering their widespread deployment. To this end, this study proposes a light-weight Vision Transformer architecture designed explicitly for HDR reconstruction to overcome these limitations. This study is based on the Context-Aware Vision Transformer and begins by converting input images to the YCbCr color space to separate luminance and chrominance information. It then employs an Intersection-Aware Adaptive Fusion (IAAF) module to suppress ghosting effectively. To further achieve a light-weight design, we introduce Inverted Residual Embedding (IRE), Dynamic Tanh (DyT), and propose Enhanced Multi-Scale Dilated Convolution (E-MSDC) to reduce computational complexity at multiple levels. Our study ultimately contributes two model versions: a main version for high visual quality and a light-weight version with advantages in computational efficiency, both of which achieve an excellent balance between performance and image quality. Experimental results demonstrate that, compared to the baseline, the main version reduces FLOPS by approximately 67% and increases inference speed by more than fivefold on CPU and 2.5 times on an edge device. These results confirm that our method provides an efficient and ghost-free HDR imaging solution for edge devices, demonstrating versatility and practicality across various dynamic scenarios.
[174] BiTAA: A Bi-Task Adversarial Attack for Object Detection and Depth Estimation via 3D Gaussian Splatting
Yixun Zhang, Feng Zhou, Jianqin Yin
Main category: cs.CV
TL;DR: BiTAA is a bi-task adversarial attack framework that simultaneously degrades object detection and biases monocular depth estimation using 3D Gaussian Splatting, with controllable depth manipulation and cross-task transfer analysis.
Details
Motivation: Existing 2D/3D adversarial attacks operate in task silos, lack mechanisms for controllable depth bias, and have no standardized protocol to quantify cross-task transfer between detection and depth estimation, leaving their interaction underexplored.Method: A dual-model attack framework built on 3D Gaussian Splatting that supports full-image and patch attacks, compatible with common detectors and depth estimators. Uses composite loss coupling detection suppression with signed, magnitude-controlled log-depth bias in ROIs, with optional EOT for physical reality.
Result: The attack shows consistent cross-task degradation and reveals clear asymmetry in transfer between detection-to-depth and depth-to-detection tasks. Real-world evaluations demonstrate practical risks for multi-task camera-only perception.
Conclusion: BiTAA highlights security vulnerabilities in autonomous driving perception systems and motivates the need for cross-task-aware defenses to protect against simultaneous multi-task adversarial attacks.
Abstract: Camera-based perception is critical to autonomous driving yet remains vulnerable to task-specific adversarial manipulations in object detection and monocular depth estimation. Most existing 2D/3D attacks are developed in task silos, lack mechanisms to induce controllable depth bias, and offer no standardized protocol to quantify cross-task transfer, leaving the interaction between detection and depth underexplored. We present BiTAA, a bi-task adversarial attack built on 3D Gaussian Splatting that yields a single perturbation capable of simultaneously degrading detection and biasing monocular depth. Specifically, we introduce a dual-model attack framework that supports both full-image and patch settings and is compatible with common detectors and depth estimators, with optional expectation-over-transformation (EOT) for physical reality. In addition, we design a composite loss that couples detection suppression with a signed, magnitude-controlled log-depth bias within regions of interest (ROIs) enabling controllable near or far misperception while maintaining stable optimization across tasks. We also propose a unified evaluation protocol with cross-task transfer metrics and real-world evaluations, showing consistent cross-task degradation and a clear asymmetry between Det to Depth and from Depth to Det transfer. The results highlight practical risks for multi-task camera-only perception and motivate cross-task-aware defenses in autonomous driving scenarios.
[175] StrCGAN: A Generative Framework for Stellar Image Restoration
Shantanusinh Parmar
Main category: cs.CV
TL;DR: StrCGAN is a generative model that enhances low-resolution astrophotography images using 3D convolutional layers, multi-spectral fusion, and astrophysical regularization to create high-fidelity celestial object representations.
Details
Motivation: Traditional CycleGAN models are limited to 2D mappings and often distort stellar morphology when enhancing low-resolution astrophotography from small telescopes like the MobilTelesco dataset.Method: Extends CycleGAN with 3D convolutional layers for volumetric spatial correlations, multi-spectral fusion for optical-NIR domain alignment, and astrophysical regularization modules to preserve stellar morphology, guided by multi-mission all-sky survey ground truth.
Result: StrCGAN generates reconstructions that are visually sharper and physically consistent across spectral bands, outperforming standard GAN models in astrophysical image enhancement.
Conclusion: The proposed StrCGAN framework successfully overcomes limitations of traditional models by incorporating 3D spatial awareness and astrophysical constraints, producing superior quality celestial image enhancements.
Abstract: We introduce StrCGAN (Stellar Cyclic GAN), a generative model designed to enhance low-resolution astrophotography images. Our goal is to reconstruct high-fidelity ground truth-like representations of celestial objects, a task that is challenging due to the limited resolution and quality of small-telescope observations such as the MobilTelesco dataset. Traditional models such as CycleGAN provide a foundation for image-to-image translation but are restricted to 2D mappings and often distort the morphology of stars and galaxies. To overcome these limitations, we extend the CycleGAN framework with three key innovations: 3D convolutional layers to capture volumetric spatial correlations, multi-spectral fusion to align optical and near-infrared (NIR) domains, and astrophysical regularization modules to preserve stellar morphology. Ground-truth references from multi-mission all-sky surveys spanning optical to NIR guide the training process, ensuring that reconstructions remain consistent across spectral bands. Together, these components allow StrCGAN to generate reconstructions that are not only visually sharper but also physically consistent, outperforming standard GAN models in the task of astrophysical image enhancement.
[176] Adaptive Model Ensemble for Continual Learning
Yuchuan Mao, Zhi Gao, Xiaomeng Fan, Yuwei Wu, Yunde Jia, Chenchen Jing
Main category: cs.CV
TL;DR: Meta-weight-ensembler is a meta-learning approach that adaptively generates layer-wise mixing coefficients for model ensemble in continual learning, addressing knowledge conflicts at both task and layer levels to alleviate catastrophic forgetting.
Details
Motivation: Existing model ensemble methods in continual learning suffer from knowledge conflict issues at task and layer levels, which compromise learning performance in both old and new tasks.Method: Proposes a mixing coefficient generator trained via meta-learning to generate appropriate mixing coefficients for model ensemble. The coefficients are individually generated for each layer to address both task-level and layer-level knowledge conflicts.
Result: Experiments on multiple continual learning datasets show that meta-weight-ensembler effectively alleviates catastrophic forgetting and achieves state-of-the-art performance.
Conclusion: Meta-weight-ensembler can be flexibly combined with existing continual learning methods to boost their ability of alleviating catastrophic forgetting, providing an effective strategy for adaptive knowledge fusion in continual learning.
Abstract: Model ensemble is an effective strategy in continual learning, which alleviates catastrophic forgetting by interpolating model parameters, achieving knowledge fusion learned from different tasks. However, existing model ensemble methods usually encounter the knowledge conflict issue at task and layer levels, causing compromised learning performance in both old and new tasks. To solve this issue, we propose meta-weight-ensembler that adaptively fuses knowledge of different tasks for continual learning. Concretely, we employ a mixing coefficient generator trained via meta-learning to generate appropriate mixing coefficients for model ensemble to address the task-level knowledge conflict. The mixing coefficient is individually generated for each layer to address the layer-level knowledge conflict. In this way, we learn the prior knowledge about adaptively accumulating knowledge of different tasks in a fused model, achieving efficient learning in both old and new tasks. Meta-weight-ensembler can be flexibly combined with existing continual learning methods to boost their ability of alleviating catastrophic forgetting. Experiments on multiple continual learning datasets show that meta-weight-ensembler effectively alleviates catastrophic forgetting and achieves state-of-the-art performance.
[177] ThinkFake: Reasoning in Multimodal Large Language Models for AI-Generated Image Detection
Tai-Ming Huang, Wei-Tung Lin, Kai-Lung Hua, Wen-Huang Cheng, Junichi Yamagishi, Jun-Cheng Chen
Main category: cs.CV
TL;DR: ThinkFake is a reasoning-based framework using MLLM with forgery reasoning prompts and GRPO reinforcement learning for AI-generated image detection, achieving state-of-the-art performance with interpretable outputs.
Details
Motivation: Address limitations of existing AI-generated image detection methods that rely on binary classification without explanations or supervised fine-tuning, which leads to poor generalization.Method: Uses Multimodal Large Language Model (MLLM) with forgery reasoning prompts, trained using Group Relative Policy Optimization (GRPO) reinforcement learning with structured reward functions for step-by-step reasoning.
Result: Outperforms state-of-the-art methods on GenImage benchmark and shows strong zero-shot generalization on LOKI benchmark.
Conclusion: ThinkFake provides an effective and robust framework for AI-generated image detection with interpretable reasoning capabilities and strong generalization performance.
Abstract: The increasing realism of AI-generated images has raised serious concerns about misinformation and privacy violations, highlighting the urgent need for accurate and interpretable detection methods. While existing approaches have made progress, most rely on binary classification without explanations or depend heavily on supervised fine-tuning, resulting in limited generalization. In this paper, we propose ThinkFake, a novel reasoning-based and generalizable framework for AI-generated image detection. Our method leverages a Multimodal Large Language Model (MLLM) equipped with a forgery reasoning prompt and is trained using Group Relative Policy Optimization (GRPO) reinforcement learning with carefully designed reward functions. This design enables the model to perform step-by-step reasoning and produce interpretable, structured outputs. We further introduce a structured detection pipeline to enhance reasoning quality and adaptability. Extensive experiments show that ThinkFake outperforms state-of-the-art methods on the GenImage benchmark and demonstrates strong zero-shot generalization on the challenging LOKI benchmark. These results validate our framework’s effectiveness and robustness. Code will be released upon acceptance.
[178] PersONAL: Towards a Comprehensive Benchmark for Personalized Embodied Agents
Filippo Ziliotto, Jelin Raphael Akkara, Alessandro Daniele, Lamberto Ballan, Luciano Serafini, Tommaso Campari
Main category: cs.CV
TL;DR: PersONAL benchmark introduces personalized object navigation and localization tasks for embodied AI agents, requiring them to find objects associated with specific users in photorealistic home environments.
Details
Motivation: Current embodied AI agents struggle with modeling individual human preferences and behaviors, making deployment in realistic human-centered scenarios like domestic households challenging.Method: Created a comprehensive benchmark with 2,000+ episodes across 30+ HM3D homes, featuring natural-language queries that require reasoning about user-object associations. Supports two evaluation modes: active navigation in unseen environments and object grounding in mapped scenes.
Result: Experiments with state-of-the-art baselines show a substantial performance gap compared to human capabilities, demonstrating the difficulty of personalized reasoning tasks.
Conclusion: The benchmark highlights the need for embodied agents that can perceive, reason, and memorize personalized information, paving the way for real-world assistive robotics applications.
Abstract: Recent advances in Embodied AI have enabled agents to perform increasingly complex tasks and adapt to diverse environments. However, deploying such agents in realistic human-centered scenarios, such as domestic households, remains challenging, particularly due to the difficulty of modeling individual human preferences and behaviors. In this work, we introduce PersONAL (PERSonalized Object Navigation And Localization, a comprehensive benchmark designed to study personalization in Embodied AI. Agents must identify, retrieve, and navigate to objects associated with specific users, responding to natural-language queries such as “find Lily’s backpack”. PersONAL comprises over 2,000 high-quality episodes across 30+ photorealistic homes from the HM3D dataset. Each episode includes a natural-language scene description with explicit associations between objects and their owners, requiring agents to reason over user-specific semantics. The benchmark supports two evaluation modes: (1) active navigation in unseen environments, and (2) object grounding in previously mapped scenes. Experiments with state-of-the-art baselines reveal a substantial gap to human performance, highlighting the need for embodied agents capable of perceiving, reasoning, and memorizing over personalized information; paving the way towards real-world assistive robot.
[179] FreezeVLA: Action-Freezing Attacks against Vision-Language-Action Models
Xin Wang, Jie Li, Zejia Weng, Yixu Wang, Yifeng Gao, Tianyu Pang, Chao Du, Yan Teng, Yingchun Wang, Zuxuan Wu, Xingjun Ma, Yu-Gang Jiang
Main category: cs.CV
TL;DR: The paper identifies a critical vulnerability in Vision-Language-Action (VLA) models where adversarial images can ‘freeze’ the models, causing them to ignore subsequent instructions and potentially leading to dangerous inaction in robotics applications.
Details
Motivation: VLA models are advancing robotics but their safety and robustness against adversarial attacks remain largely unexplored. The researchers discovered that adversarial images can effectively disconnect a robot's decision-making from its physical actions, posing serious safety risks.Method: The authors propose FreezeVLA, a novel attack framework using min-max bi-level optimization to generate and evaluate action-freezing attacks on VLA models.
Result: Experiments on three state-of-the-art VLA models and four robotic benchmarks show FreezeVLA achieves an average attack success rate of 76.2%, significantly outperforming existing methods. The adversarial images also exhibit strong transferability across diverse language prompts.
Conclusion: The findings expose a critical safety risk in VLA models and highlight the urgent need for robust defense mechanisms against such adversarial attacks.
Abstract: Vision-Language-Action (VLA) models are driving rapid progress in robotics by enabling agents to interpret multimodal inputs and execute complex, long-horizon tasks. However, their safety and robustness against adversarial attacks remain largely underexplored. In this work, we identify and formalize a critical adversarial vulnerability in which adversarial images can “freeze” VLA models and cause them to ignore subsequent instructions. This threat effectively disconnects the robot’s digital mind from its physical actions, potentially inducing inaction during critical interventions. To systematically study this vulnerability, we propose FreezeVLA, a novel attack framework that generates and evaluates action-freezing attacks via min-max bi-level optimization. Experiments on three state-of-the-art VLA models and four robotic benchmarks show that FreezeVLA attains an average attack success rate of 76.2%, significantly outperforming existing methods. Moreover, adversarial images generated by FreezeVLA exhibit strong transferability, with a single image reliably inducing paralysis across diverse language prompts. Our findings expose a critical safety risk in VLA models and highlight the urgent need for robust defense mechanisms.
[180] Adaptive Guidance Semantically Enhanced via Multimodal LLM for Edge-Cloud Object Detection
Yunqing Hu, Zheming Yang, Chang Zhao, Wen Ji
Main category: cs.CV
TL;DR: Proposes an adaptive edge-cloud collaborative object detection method using MLLMs to enhance detection in complex scenarios like low-light and occlusions, achieving significant latency and computational cost reductions while maintaining accuracy.
Details
Motivation: Traditional object detection methods degrade in complex scenarios due to lack of high-level semantic understanding, particularly in challenging conditions like low-light and heavy occlusions.Method: Uses instruction fine-tuning on MLLMs to generate structured scene descriptions, implements adaptive mapping to convert semantic information into parameter adjustment signals for edge detectors, and employs edge-cloud collaborative inference with confidence-based switching between cloud guidance and edge detection.
Result: Reduces latency by over 79% and computational cost by 70% in low-light and highly occluded scenes while maintaining detection accuracy.
Conclusion: The proposed adaptive semantic enhancement method effectively balances accuracy and efficiency, demonstrating significant improvements in complex detection scenarios through edge-cloud collaboration and MLLM integration.
Abstract: Traditional object detection methods face performance degradation challenges in complex scenarios such as low-light conditions and heavy occlusions due to a lack of high-level semantic understanding. To address this, this paper proposes an adaptive guidance-based semantic enhancement edge-cloud collaborative object detection method leveraging Multimodal Large Language Models (MLLM), achieving an effective balance between accuracy and efficiency. Specifically, the method first employs instruction fine-tuning to enable the MLLM to generate structured scene descriptions. It then designs an adaptive mapping mechanism that dynamically converts semantic information into parameter adjustment signals for edge detectors, achieving real-time semantic enhancement. Within an edge-cloud collaborative inference framework, the system automatically selects between invoking cloud-based semantic guidance or directly outputting edge detection results based on confidence scores. Experiments demonstrate that the proposed method effectively enhances detection accuracy and efficiency in complex scenes. Specifically, it can reduce latency by over 79% and computational cost by 70% in low-light and highly occluded scenes while maintaining accuracy.
[181] Generalized Shortest Path-based Superpixels for 3D Spherical Image Segmentation
Rémi Giraud, Rodrigo Borba Pinheiro, Yannick Berthoumieu
Main category: cs.CV
TL;DR: A new superpixel method called SphSPS is introduced for 360° spherical/omnidirectional images, which considers 3D spherical geometry to improve segmentation accuracy and shape regularity compared to planar methods.
Details
Motivation: Standard superpixel methods are designed for 2D planar images but fail to properly handle wide-angle 360° spherical images due to distortion and geometry differences. There's a need for dedicated approaches that respect the spherical acquisition space.Method: SphSPS generalizes the shortest path concept between pixels and superpixel centers by considering 3D spherical geometry. It computes relevant clustering features while respecting the acquisition space geometry, and introduces a new global regularity metric for spherical space.
Result: The method significantly outperforms both planar and spherical state-of-the-art approaches in segmentation accuracy, robustness to noise, and regularity on 360° spherical panorama datasets and synthetic road omnidirectional images.
Conclusion: SphSPS provides an effective tool for superpixel-based applications on 360° images by properly handling spherical geometry, addressing limitations of existing methods designed for planar images.
Abstract: The growing use of wide angle image capture devices and the need for fast and accurate image analysis in computer visions have enforced the need for dedicated under-representation approaches. Most recent decomposition methods segment an image into a small number of irregular homogeneous regions, called superpixels. Nevertheless, these approaches are generally designed to segment standard 2D planar images, i.e., captured with a 90o angle view without distortion. In this work, we introduce a new general superpixel method called SphSPS (for Spherical Shortest Path-based Superpixels)1 , dedicated to wide 360o spherical or omnidirectional images. Our method respects the geometry of the 3D spherical acquisition space and generalizes the notion of shortest path between a pixel and a superpixel center, to fastly extract relevant clustering features. We demonstrate that considering the geometry of the acquisition space to compute the shortest path enables to jointly improve the segmentation accuracy and the shape regularity of superpixels. To evaluate this regularity aspect, we also generalize a global regularity metric to the spherical space, addressing the limitations of the only existing spherical compactness measure. Finally, the proposed SphSPS method is validated on the reference 360o spherical panorama segmentation dataset and on synthetic road omnidirectional images. Our method significantly outperforms both planar and spherical state-of-the-art approaches in terms of segmentation accuracy,robustness to noise and regularity, providing a very interesting tool for superpixel-based applications on 360o images.
[182] Efficient Cell Painting Image Representation Learning via Cross-Well Aligned Masked Siamese Network
Pin-Jui Huang, Yu-Hsuan Liao, SooHeon Kim, NoSeong Park, JongBae Park, DongMyung Shin
Main category: cs.CV
TL;DR: CWA-MSN is a novel representation learning framework that learns batch-robust cell painting representations by aligning embeddings of cells with the same perturbation across different wells, achieving state-of-the-art performance with significantly fewer data and parameters.
Details
Motivation: To address the challenge of extracting biologically meaningful and batch-robust cell painting representations for drug discovery, as conventional self-supervised and contrastive learning methods require large-scale models and data while still struggling with batch effects.Method: Cross-Well Aligned Masked Siamese Network (CWA-MSN) - a representation learning framework that enforces semantic consistency by aligning embeddings of cells subjected to the same perturbation across different wells, integrated into a masked siamese architecture.
Result: CWA-MSN outperforms state-of-the-art methods (OpenPhenom and CellCLIP) by +29% and +9% respectively in gene-gene relationship retrieval, while using substantially fewer data (0.2M vs 2.2M images) and smaller model size (22M vs 1.48B parameters).
Conclusion: CWA-MSN is a simple and effective method for learning cell image representations that enables efficient phenotype modeling under limited data and parameter budgets.
Abstract: Computational models that predict cellular phenotypic responses to chemical and genetic perturbations can accelerate drug discovery by prioritizing therapeutic hypotheses and reducing costly wet-lab iteration. However, extracting biologically meaningful and batch-robust cell painting representations remains challenging. Conventional self-supervised and contrastive learning approaches often require a large-scale model and/or a huge amount of carefully curated data, still struggling with batch effects. We present Cross-Well Aligned Masked Siamese Network (CWA-MSN), a novel representation learning framework that aligns embeddings of cells subjected to the same perturbation across different wells, enforcing semantic consistency despite batch effects. Integrated into a masked siamese architecture, this alignment yields features that capture fine-grained morphology while remaining data- and parameter-efficient. For instance, in a gene-gene relationship retrieval benchmark, CWA-MSN outperforms the state-of-the-art publicly available self-supervised (OpenPhenom) and contrastive learning (CellCLIP) methods, improving the benchmark scores by +29% and +9%, respectively, while training on substantially fewer data (e.g., 0.2M images for CWA-MSN vs. 2.2M images for OpenPhenom) or smaller model size (e.g., 22M parameters for CWA-MSN vs. 1.48B parameters for CellCLIP). Extensive experiments demonstrate that CWA-MSN is a simple and effective way to learn cell image representation, enabling efficient phenotype modeling even under limited data and parameter budgets.
[183] Aerial-Ground Image Feature Matching via 3D Gaussian Splatting-based Intermediate View Rendering
Jiangxue Yu, Hui Wang, San Jiang, Xing Zhang, Dejin Zhang, Qingquan Li
Main category: cs.CV
TL;DR: A feature matching algorithm for aerial and ground images using intermediate view generation to bridge perspective distortions caused by viewpoint changes.
Details
Motivation: The integration of aerial and ground images for 3D modeling is challenging due to difficulty finding reliable correspondences between images with extensive viewpoint differences.Method: Uses incremental SfM on aerial images to create sparse models, then applies 3D Gaussian Splatting for scene rendering. Generates intermediate images to bridge aerial-ground gap, enabling reliable feature matching through render-aerial and render-ground image pairs.
Result: The method significantly increases both initial and refined matches compared to existing methods, enabling accurate ISfM reconstruction and complete 3DGS-based scene rendering.
Conclusion: The proposed solution effectively addresses the feature matching challenge between aerial and ground images through intermediate view generation, providing reliable matches for complex 3D scene modeling.
Abstract: The integration of aerial and ground images has been a promising solution in 3D modeling of complex scenes, which is seriously restricted by finding reliable correspondences. The primary contribution of this study is a feature matching algorithm for aerial and ground images, whose core idea is to generate intermediate views to alleviate perspective distortions caused by the extensive viewpoint changes. First, by using aerial images only, sparse models are reconstructed through an incremental SfM (Structure from Motion) engine due to their large scene coverage. Second, 3D Gaussian Splatting is then adopted for scene rendering by taking as inputs sparse points and oriented images. For accurate view rendering, a render viewpoint determination algorithm is designed by using the oriented camera poses of aerial images, which is used to generate high-quality intermediate images that can bridge the gap between aerial and ground images. Third, with the aid of intermediate images, reliable feature matching is conducted for match pairs from render-aerial and render-ground images, and final matches can be generated by transmitting correspondences through intermediate views. By using real aerial and ground datasets, the validation of the proposed solution has been verified in terms of feature matching and scene rendering and compared comprehensively with widely used methods. The experimental results demonstrate that the proposed solution can provide reliable feature matches for aerial and ground images with an obvious increase in the number of initial and refined matches, and it can provide enough matches to achieve accurate ISfM reconstruction and complete 3DGS-based scene rendering.
[184] CapStARE: Capsule-based Spatiotemporal Architecture for Robust and Efficient Gaze Estimation
Miren Samaniego, Igor Rodriguez, Elena Lazkano
Main category: cs.CV
TL;DR: CapStARE is a capsule-based spatio-temporal architecture for gaze estimation that combines ConvNeXt backbone, capsule formation with attention routing, and dual GRU decoders for efficient part-whole reasoning and temporal modeling, achieving state-of-the-art performance with real-time inference.
Details
Motivation: To develop a practical and robust solution for real-time gaze estimation in interactive systems by addressing the need for efficient part-whole reasoning and disentangled temporal modeling of gaze dynamics.Method: Uses a modular design with ConvNeXt backbone for feature extraction, capsule formation with attention routing for part-whole reasoning, and dual GRU decoders specialized for slow and rapid gaze dynamics to handle temporal modeling.
Result: Achieves state-of-the-art performance on ETH-XGaze (3.36), MPIIFaceGaze (2.65), Gaze360 (9.06), and RT-GENE (4.76) with real-time inference (<10ms), outperforming existing methods with fewer parameters and greater interpretability.
Conclusion: CapStARE offers a practical and robust solution for real-time gaze estimation in interactive systems, demonstrating superior performance across multiple datasets while maintaining efficiency and interpretability.
Abstract: We introduce CapStARE, a capsule-based spatio-temporal architecture for gaze estimation that integrates a ConvNeXt backbone, capsule formation with attention routing, and dual GRU decoders specialized for slow and rapid gaze dynamics. This modular design enables efficient part-whole reasoning and disentangled temporal modeling, achieving state-of-the-art performance on ETH-XGaze (3.36) and MPIIFaceGaze (2.65) while maintaining real-time inference (< 10 ms). The model also generalizes well to unconstrained conditions in Gaze360 (9.06) and human-robot interaction scenarios in RT-GENE (4.76), outperforming or matching existing methods with fewer parameters and greater interpretability. These results demonstrate that CapStARE offers a practical and robust solution for real-time gaze estimation in interactive systems. The related code and results for this article can be found on: https://github.com/toukapy/capsStare
[185] GS-RoadPatching: Inpainting Gaussians via 3D Searching and Placing for Driving Scenes
Guo Chen, Jiarun Liu, Sicong Du, Chenming Wu, Deqi Li, Shi-Sheng Huang, Guofeng Zhang, Sheng Yang
Main category: cs.CV
TL;DR: GS-RoadPatching is a 3D Gaussian Splatting-based inpainting method for driving scenes that performs substitutional completion by matching similar structural patterns within the 3DGS feature space, eliminating the need for 2D diffusion models or retraining.
Details
Motivation: Existing 3DGS inpainting methods rely on 2D perspective-view-based diffusion/GAN models which require spatial-temporal consistency and time-intensive retraining. Driving scenes have highly repetitive patterns suitable for structural matching in 3D space.Method: Constructs feature-embedded 3DGS scenes with patch measurement for local context abstraction, uses structural search to find candidate patches in 3D space, and applies substitution-and-fusion optimization for visual harmony.
Result: Extensive experiments on multiple datasets show state-of-the-art performance in driving scenes, with superior quality and interoperability compared to baselines. Additional experiments demonstrate applicability in general scenes.
Conclusion: The method enables effective 3DGS-based substitutional inpainting directly through the 3DGS modality, providing an efficient alternative to 2D cross-modal approaches while achieving better visual results.
Abstract: This paper presents GS-RoadPatching, an inpainting method for driving scene completion by referring to completely reconstructed regions, which are represented by 3D Gaussian Splatting (3DGS). Unlike existing 3DGS inpainting methods that perform generative completion relying on 2D perspective-view-based diffusion or GAN models to predict limited appearance or depth cues for missing regions, our approach enables substitutional scene inpainting and editing directly through the 3DGS modality, extricating it from requiring spatial-temporal consistency of 2D cross-modals and eliminating the need for time-intensive retraining of Gaussians. Our key insight is that the highly repetitive patterns in driving scenes often share multi-modal similarities within the implicit 3DGS feature space and are particularly suitable for structural matching to enable effective 3DGS-based substitutional inpainting. Practically, we construct feature-embedded 3DGS scenes to incorporate a patch measurement method for abstracting local context at different scales and, subsequently, propose a structural search method to find candidate patches in 3D space effectively. Finally, we propose a simple yet effective substitution-and-fusion optimization for better visual harmony. We conduct extensive experiments on multiple publicly available datasets to demonstrate the effectiveness and efficiency of our proposed method in driving scenes, and the results validate that our method achieves state-of-the-art performance compared to the baseline methods in terms of both quality and interoperability. Additional experiments in general scenes also demonstrate the applicability of the proposed 3D inpainting strategy. The project page and code are available at: https://shanzhaguoo.github.io/GS-RoadPatching/
[186] Interpreting ResNet-based CLIP via Neuron-Attention Decomposition
Edmund Bu, Yossi Gandelsman
Main category: cs.CV
TL;DR: A novel technique for interpreting CLIP-ResNet neurons by decomposing their contributions into individual computation paths through neuron-head pairs, enabling text association and applications in semantic segmentation and dataset monitoring.
Details
Motivation: To understand and interpret the internal representations of CLIP-ResNet by analyzing how individual neurons and attention heads contribute to the model's output through their computation paths.Method: Analyze all pairwise combinations of neurons and following attention heads in CLIP’s attention-pooling layer, approximate neuron-head pairs as single directions in the embedding space, and associate them with text for interpretation.
Result: Found that neuron-head pairs can be approximated by single directions, only sparse sets significantly contribute to output, some polysemantic pairs represent sub-concepts, and successfully applied to semantic segmentation and dataset distribution monitoring.
Conclusion: Examining individual computation paths reveals interpretable units in neural networks that can be effectively utilized for downstream tasks like semantic segmentation and dataset shift monitoring.
Abstract: We present a novel technique for interpreting the neurons in CLIP-ResNet by decomposing their contributions to the output into individual computation paths. More specifically, we analyze all pairwise combinations of neurons and the following attention heads of CLIP’s attention-pooling layer. We find that these neuron-head pairs can be approximated by a single direction in CLIP-ResNet’s image-text embedding space. Leveraging this insight, we interpret each neuron-head pair by associating it with text. Additionally, we find that only a sparse set of the neuron-head pairs have a significant contribution to the output value, and that some neuron-head pairs, while polysemantic, represent sub-concepts of their corresponding neurons. We use these observations for two applications. First, we employ the pairs for training-free semantic segmentation, outperforming previous methods for CLIP-ResNet. Second, we utilize the contributions of neuron-head pairs to monitor dataset distribution shifts. Our results demonstrate that examining individual computation paths in neural networks uncovers interpretable units, and that such units can be utilized for downstream tasks.
[187] When Words Can’t Capture It All: Towards Video-Based User Complaint Text Generation with Multimodal Video Complaint Dataset
Sarmistha Das, R E Zera Marveen Lyngkhoi, Kirtan Jain, Vinayak Goyal, Sriparna Saha, Manish Gupta
Main category: cs.CV
TL;DR: This paper introduces Complaint Description from Videos (CoD-V), a new task for generating expressive complaint descriptions from videos showing product defects, along with a dataset called ComVID and a multimodal RAG-enhanced VideoLLaMA2-7b model.
Details
Motivation: Users struggle to articulate complaints clearly in text but can easily upload videos showing product defects. Current explainable complaint mining approaches often leave issues unresolved due to poor textual expression.Method: The authors introduce ComVID dataset with 1,175 complaint videos and descriptions, propose a new complaint retention (CR) metric, and develop a multimodal Retrieval-Augmented Generation (RAG) embedded VideoLLaMA2-7b model that considers the user’s emotional state.
Result: The paper presents comprehensive evaluation of Video Language Models using metrics like METEOR, perplexity, and Coleman-Liau readability score, showing the effectiveness of their approach for complaint generation from videos.
Conclusion: This work establishes a foundation for enabling users to express complaints through video, providing a new research direction in complaint mining with practical applications for customer service and product feedback.
Abstract: While there exists a lot of work on explainable complaint mining, articulating user concerns through text or video remains a significant challenge, often leaving issues unresolved. Users frequently struggle to express their complaints clearly in text but can easily upload videos depicting product defects (e.g., vague text such as `worst product’ paired with a 5-second video depicting a broken headphone with the right earcup). This paper formulates a new task in the field of complaint mining to aid the common users’ need to write an expressive complaint, which is Complaint Description from Videos (CoD-V) (e.g., to help the above user articulate her complaint about the defective right earcup). To this end, we introduce ComVID, a video complaint dataset containing 1,175 complaint videos and the corresponding descriptions, also annotated with the emotional state of the complainer. Additionally, we present a new complaint retention (CR) evaluation metric that discriminates the proposed (CoD-V) task against standard video summary generation and description tasks. To strengthen this initiative, we introduce a multimodal Retrieval-Augmented Generation (RAG) embedded VideoLLaMA2-7b model, designed to generate complaints while accounting for the user’s emotional state. We conduct a comprehensive evaluation of several Video Language Models on several tasks (pre-trained and fine-tuned versions) with a range of established evaluation metrics, including METEOR, perplexity, and the Coleman-Liau readability score, among others. Our study lays the foundation for a new research direction to provide a platform for users to express complaints through video. Dataset and resources are available at: https://github.com/sarmistha-D/CoD-V.
[188] SynchroRaMa : Lip-Synchronized and Emotion-Aware Talking Face Generation via Multi-Modal Emotion Embedding
Phyo Thet Yee, Dimitrios Kollias, Sudeepta Mishra, Abhinav Dhall
Main category: cs.CV
TL;DR: SynchroRaMa is a novel framework for audio-driven talking face generation that integrates multi-modal emotion embedding from text and audio, and uses LLM-generated scene descriptions to enhance temporal consistency and visual realism.
Details
Motivation: Existing emotion-aware talking face generation methods rely on single modality emotion embedding and single reference images, limiting their ability to capture nuanced affective cues and represent dynamic changes across time.Method: SynchroRaMa combines emotional signals from text (sentiment analysis) and audio (speech emotion recognition, valence-arousal features), includes an audio-to-motion module for lip synchronization, and incorporates LLM-generated scene descriptions as additional textual input.
Result: Quantitative and qualitative experiments show SynchroRaMa outperforms state-of-the-art methods in image quality, expression preservation, and motion realism. User study confirms higher subjective ratings in naturalness, motion diversity, and video smoothness.
Conclusion: SynchroRaMa successfully addresses limitations of existing methods by integrating multi-modal emotion embedding and dynamic scene descriptions, achieving superior performance in generating expressive and natural talking face videos.
Abstract: Audio-driven talking face generation has received growing interest, particularly for applications requiring expressive and natural human-avatar interaction. However, most existing emotion-aware methods rely on a single modality (either audio or image) for emotion embedding, limiting their ability to capture nuanced affective cues. Additionally, most methods condition on a single reference image, restricting the model’s ability to represent dynamic changes in actions or attributes across time. To address these issues, we introduce SynchroRaMa, a novel framework that integrates a multi-modal emotion embedding by combining emotional signals from text (via sentiment analysis) and audio (via speech-based emotion recognition and audio-derived valence-arousal features), enabling the generation of talking face videos with richer and more authentic emotional expressiveness and fidelity. To ensure natural head motion and accurate lip synchronization, SynchroRaMa includes an audio-to-motion (A2M) module that generates motion frames aligned with the input audio. Finally, SynchroRaMa incorporates scene descriptions generated by Large Language Model (LLM) as additional textual input, enabling it to capture dynamic actions and high-level semantic attributes. Conditioning the model on both visual and textual cues enhances temporal consistency and visual realism. Quantitative and qualitative experiments on benchmark datasets demonstrate that SynchroRaMa outperforms the state-of-the-art, achieving improvements in image quality, expression preservation, and motion realism. A user study further confirms that SynchroRaMa achieves higher subjective ratings than competing methods in overall naturalness, motion diversity, and video smoothness. Our project page is available at https://novicemm.github.io/synchrorama.
[189] OmniScene: Attention-Augmented Multimodal 4D Scene Understanding for Autonomous Driving
Pei Liu, Hongliang Lu, Haichao Liu, Haipeng Liu, Xin Liu, Ruoyu Yao, Shengbo Eben Li, Jun Ma
Main category: cs.CV
TL;DR: OmniScene is a novel human-like framework for autonomous driving that integrates multi-view and temporal perception with vision-language modeling for holistic 4D scene understanding, achieving state-of-the-art performance across multiple tasks.
Details
Motivation: Current autonomous driving systems lack true scene understanding capabilities, relying primarily on depth-based 3D reconstruction rather than human-like egocentric 3D scene comprehension that enables adaptive behaviors.Method: Proposes OmniScene with OmniVLM vision-language model for 4D scene understanding, uses teacher-student architecture with knowledge distillation to embed textual representations into 3D instance features, and introduces Hierarchical Fusion Strategy (HFS) for adaptive multimodal integration of geometric and semantic features.
Result: Comprehensive evaluation on nuScenes dataset shows superior performance against over ten state-of-the-art models across perception, prediction, planning, and visual question answering tasks, establishing new benchmarks.
Conclusion: The proposed human-like framework successfully bridges the gap between current autonomous driving systems and human-like scene understanding capabilities, demonstrating the effectiveness of integrating vision-language modeling with hierarchical multimodal fusion for holistic 4D scene comprehension.
Abstract: Human vision is capable of transforming two-dimensional observations into an egocentric three-dimensional scene understanding, which underpins the ability to translate complex scenes and exhibit adaptive behaviors. This capability, however, remains lacking in current autonomous driving systems, where mainstream approaches primarily rely on depth-based 3D reconstruction rather than true scene understanding. To address this limitation, we propose a novel human-like framework called OmniScene. First, we introduce the OmniScene Vision-Language Model (OmniVLM), a vision-language framework that integrates multi-view and temporal perception for holistic 4D scene understanding. Then, harnessing a teacher-student OmniVLM architecture and knowledge distillation, we embed textual representations into 3D instance features for semantic supervision, enriching feature learning, and explicitly capturing human-like attentional semantics. These feature representations are further aligned with human driving behaviors, forming a more human-like perception-understanding-action architecture. In addition, we propose a Hierarchical Fusion Strategy (HFS) to address imbalances in modality contributions during multimodal integration. Our approach adaptively calibrates the relative significance of geometric and semantic features at multiple abstraction levels, enabling the synergistic use of complementary cues from visual and textual modalities. This learnable dynamic fusion enables a more nuanced and effective exploitation of heterogeneous information. We evaluate OmniScene comprehensively on the nuScenes dataset, benchmarking it against over ten state-of-the-art models across various tasks. Our approach consistently achieves superior results, establishing new benchmarks in perception, prediction, planning, and visual question answering.
[190] CamPVG: Camera-Controlled Panoramic Video Generation with Epipolar-Aware Diffusion
Chenhao Ji, Chaohui Yu, Junyao Gao, Fan Wang, Cairong Zhao
Main category: cs.CV
TL;DR: CamPVG is the first diffusion-based framework for panoramic video generation guided by precise camera poses, addressing challenges in panoramic pose representation and spherical projection.
Details
Motivation: Existing methods focus on camera control in perspective projection video generation, but geometrically consistent panoramic video generation remains challenging due to complexities in panoramic pose representation and spherical projection.Method: Proposes panoramic Plücker embedding for camera position encoding through spherical coordinate transformation, and a spherical epipolar module that enforces geometric constraints through adaptive attention masking along epipolar lines for cross-view feature aggregation.
Result: Extensive experiments demonstrate that the method generates high-quality panoramic videos consistent with camera trajectories, far surpassing existing methods in panoramic video generation.
Conclusion: CamPVG effectively overcomes limitations of traditional methods for equirectangular projections and enables fine-grained cross-view feature aggregation, substantially enhancing panoramic video quality and consistency.
Abstract: Recently, camera-controlled video generation has seen rapid development, offering more precise control over video generation. However, existing methods predominantly focus on camera control in perspective projection video generation, while geometrically consistent panoramic video generation remains challenging. This limitation is primarily due to the inherent complexities in panoramic pose representation and spherical projection. To address this issue, we propose CamPVG, the first diffusion-based framework for panoramic video generation guided by precise camera poses. We achieve camera position encoding for panoramic images and cross-view feature aggregation based on spherical projection. Specifically, we propose a panoramic Pl"ucker embedding that encodes camera extrinsic parameters through spherical coordinate transformation. This pose encoder effectively captures panoramic geometry, overcoming the limitations of traditional methods when applied to equirectangular projections. Additionally, we introduce a spherical epipolar module that enforces geometric constraints through adaptive attention masking along epipolar lines. This module enables fine-grained cross-view feature aggregation, substantially enhancing the quality and consistency of generated panoramic videos. Extensive experiments demonstrate that our method generates high-quality panoramic videos consistent with camera trajectories, far surpassing existing methods in panoramic video generation.
[191] SDE-DET: A Precision Network for Shatian Pomelo Detection in Complex Orchard Environments
Yihao Hu, Pan Wang, Xiaodong Bai, Shijie Cai, Hang Wang, Huazhong Liu, Aiping Yang, Xiangxiang Li, Meiping Ding, Hongyan Liu, Jianguo Yao
Main category: cs.CV
TL;DR: This paper proposes SDE-DET, a novel detection model for Shatian pomelo in complex orchard environments, achieving state-of-the-art performance on the custom STP-AgriData dataset.
Details
Motivation: Pomelo detection is crucial for automated harvesting and maturity analysis, but faces challenges like multi-scale issues, obstructions from trunks/leaves, and small object detection in complex orchard environments.Method: SDE-DET uses Star Block for high-dimensional information acquisition, Deformable Attention for occlusion handling, and multiple Efficient Multi-Scale Attention mechanisms for small object detection while reducing computational overhead.
Result: SDE-DET achieved scores of 0.883 Precision, 0.771 Recall, 0.838 mAP@0.5, 0.497 mAP@0.5:0.95, and 0.823 F1-score, outperforming Yolo series and other mainstream detection models.
Conclusion: SDE-DET provides a reliable method for Shatian pomelo detection, laying the foundation for developing automatic harvest robots in agricultural applications.
Abstract: Pomelo detection is an essential process for their localization, automated robotic harvesting, and maturity analysis. However, detecting Shatian pomelo in complex orchard environments poses significant challenges, including multi-scale issues, obstructions from trunks and leaves, small object detection, etc. To address these issues, this study constructs a custom dataset STP-AgriData and proposes the SDE-DET model for Shatian pomelo detection. SDE-DET first utilizes the Star Block to effectively acquire high-dimensional information without increasing the computational overhead. Furthermore, the presented model adopts Deformable Attention in its backbone, to enhance its ability to detect pomelos under occluded conditions. Finally, multiple Efficient Multi-Scale Attention mechanisms are integrated into our model to reduce the computational overhead and extract deep visual representations, thereby improving the capacity for small object detection. In the experiment, we compared SDE-DET with the Yolo series and other mainstream detection models in Shatian pomelo detection. The presented SDE-DET model achieved scores of 0.883, 0.771, 0.838, 0.497, and 0.823 in Precision, Recall, mAP@0.5, mAP@0.5:0.95 and F1-score, respectively. SDE-DET has achieved state-of-the-art performance on the STP-AgriData dataset. Experiments indicate that the SDE-DET provides a reliable method for Shatian pomelo detection, laying the foundation for the further development of automatic harvest robots.
[192] Improving Generalizability and Undetectability for Targeted Adversarial Attacks on Multimodal Pre-trained Models
Zhifang Zhang, Jiahan Zhang, Shengjie Zhou, Qi Wei, Shuo He, Feng Liu, Lei Feng
Main category: cs.CV
TL;DR: The paper proposes Proxy Targeted Attack (PTA), a novel method to improve targeted adversarial attacks on multimodal pre-trained models by addressing limitations in generalizability and undetectability.
Details
Motivation: Existing targeted adversarial attacks on multimodal pre-trained models have limitations in generalizability (limited effectiveness against partially known or semantically similar targets) and undetectability (easily detected by simple anomaly detection methods).Method: PTA leverages multiple source-modal and target-modal proxies to optimize targeted adversarial examples, ensuring they remain evasive to defenses while aligning with multiple potential targets. The method includes theoretical analyses to balance generalizability and undetectability.
Result: Experimental results show that PTA achieves high success rates across various related targets and remains undetectable against multiple anomaly detection methods.
Conclusion: PTA effectively addresses the limitations of existing targeted adversarial attacks on multimodal pre-trained models, providing improved generalizability and undetectability through proxy-based optimization and theoretical guarantees.
Abstract: Multimodal pre-trained models (e.g., ImageBind), which align distinct data modalities into a shared embedding space, have shown remarkable success across downstream tasks. However, their increasing adoption raises serious security concerns, especially regarding targeted adversarial attacks. In this paper, we show that existing targeted adversarial attacks on multimodal pre-trained models still have limitations in two aspects: generalizability and undetectability. Specifically, the crafted targeted adversarial examples (AEs) exhibit limited generalization to partially known or semantically similar targets in cross-modal alignment tasks (i.e., limited generalizability) and can be easily detected by simple anomaly detection methods (i.e., limited undetectability). To address these limitations, we propose a novel method called Proxy Targeted Attack (PTA), which leverages multiple source-modal and target-modal proxies to optimize targeted AEs, ensuring they remain evasive to defenses while aligning with multiple potential targets. We also provide theoretical analyses to highlight the relationship between generalizability and undetectability and to ensure optimal generalizability while meeting the specified requirements for undetectability. Furthermore, experimental results demonstrate that our PTA can achieve a high success rate across various related targets and remain undetectable against multiple anomaly detection methods.
[193] Anomaly Detection by Clustering DINO Embeddings using a Dirichlet Process Mixture
Nico Schulthess, Ender Konukoglu
Main category: cs.CV
TL;DR: Proposes using DINOv2 embeddings with Dirichlet Process Mixture Model for unsupervised anomaly detection in medical imaging, achieving competitive performance while reducing computational burden compared to memory-bank approaches.
Details
Motivation: Memory-bank approaches for anomaly detection become computationally expensive for large medical datasets. The authors aim to develop a more efficient method that leverages foundational models while maintaining performance.Method: Models normative DINOv2 embeddings with Dirichlet Process Mixture Model (DPMM), using similarity between component centers and embeddings as anomaly score function to create coarse segmentation masks, eliminating the need for memory banks.
Result: DPMM with DINOv2 embeddings achieves competitive anomaly detection performance on medical imaging benchmarks while at least halving computation time at inference. Normalized DINOv2 embeddings show better alignment with anatomical structures.
Conclusion: The proposed method provides an efficient and effective approach for unsupervised anomaly detection in medical imaging, demonstrating that DINOv2 embeddings (despite natural image training) work well for medical applications when combined with appropriate modeling techniques.
Abstract: In this work, we leverage informative embeddings from foundational models for unsupervised anomaly detection in medical imaging. For small datasets, a memory-bank of normative features can directly be used for anomaly detection which has been demonstrated recently. However, this is unsuitable for large medical datasets as the computational burden increases substantially. Therefore, we propose to model the distribution of normative DINOv2 embeddings with a Dirichlet Process Mixture model (DPMM), a non-parametric mixture model that automatically adjusts the number of mixture components to the data at hand. Rather than using a memory bank, we use the similarity between the component centers and the embeddings as anomaly score function to create a coarse anomaly segmentation mask. Our experiments show that through DPMM embeddings of DINOv2, despite being trained on natural images, achieve very competitive anomaly detection performance on medical imaging benchmarks and can do this while at least halving the computation time at inference. Our analysis further indicates that normalized DINOv2 embeddings are generally more aligned with anatomical structures than unnormalized features, even in the presence of anomalies, making them great representations for anomaly detection. The code is available at https://github.com/NicoSchulthess/anomalydino-dpmm.
[194] Table Detection with Active Learning
Somraj Gautam, Nachiketa Purohit, Gaurav Harit
Main category: cs.CV
TL;DR: Active learning approach combining uncertainty and diversity strategies for efficient object detection annotation, achieving better performance than random sampling with limited budget.
Details
Motivation: Efficient data annotation is critical for machine learning, especially in object detection tasks that require extensive labeled data. Active learning can minimize annotation costs by selecting the most informative samples.Method: Combines uncertainty-based and diversity-based active learning strategies to select representative examples that improve model generalization. Evaluated on TableBank-LaTeX and TableBank-Word datasets using CascadeTabNet and YOLOv9 architectures.
Result: AL-based example selection significantly outperforms random sampling, reducing annotation effort while maintaining comparable performance to fully supervised models. Achieves higher mAP scores within the same annotation budget.
Conclusion: The proposed active learning approach effectively reduces annotation costs for object detection tasks while maintaining high performance, demonstrating the value of combining uncertainty and diversity strategies.
Abstract: Efficient data annotation remains a critical challenge in machine learning, particularly for object detection tasks requiring extensive labeled data. Active learning (AL) has emerged as a promising solution to minimize annotation costs by selecting the most informative samples. While traditional AL approaches primarily rely on uncertainty-based selection, recent advances suggest that incorporating diversity-based strategies can enhance sampling efficiency in object detection tasks. Our approach ensures the selection of representative examples that improve model generalization. We evaluate our method on two benchmark datasets (TableBank-LaTeX, TableBank-Word) using state-of-the-art table detection architectures, CascadeTabNet and YOLOv9. Our results demonstrate that AL-based example selection significantly outperforms random sampling, reducing annotation effort given a limited budget while maintaining comparable performance to fully supervised models. Our method achieves higher mAP scores within the same annotation budget.
[195] Does the Manipulation Process Matter? RITA: Reasoning Composite Image Manipulations via Reversely-Ordered Incremental-Transition Autoregression
Xuekang Zhu, Ji-Zhe Zhou, Kaiwen Feng, Chenfan Qu, Yunfei Wang, Liting Zhou, Jian liu
Main category: cs.CV
TL;DR: RITA reformulates image manipulation localization as a conditional sequence prediction task, predicting manipulated regions layer-by-layer to model temporal dependencies and hierarchical structures in editing operations.
Details
Motivation: Existing image manipulation localization methods are process-agnostic and use one-shot prediction, which causes dimensional collapse and fails to capture the sequential and hierarchical nature of manipulation processes.Method: Proposes RITA framework that predicts manipulated regions in an ordered, layer-by-layer manner using conditional sequence prediction. Creates HSIM benchmark with multi-step manipulation data and HSS metric for evaluation.
Result: RITA achieves state-of-the-art performance on traditional benchmarks and provides a solid foundation for hierarchical localization tasks.
Conclusion: The sequential prediction paradigm effectively addresses the limitations of one-shot methods and shows potential as a general and effective approach for image manipulation localization.
Abstract: Image manipulations often entail a complex manipulation process, comprising a series of editing operations to create a deceptive image, exhibiting sequentiality and hierarchical characteristics. However, existing IML methods remain manipulation-process-agnostic, directly producing localization masks in a one-shot prediction paradigm without modeling the underlying editing steps. This one-shot paradigm compresses the high-dimensional compositional space into a single binary mask, inducing severe dimensional collapse, thereby creating a fundamental mismatch with the intrinsic nature of the IML task. To address this, we are the first to reformulate image manipulation localization as a conditional sequence prediction task, proposing the RITA framework. RITA predicts manipulated regions layer-by-layer in an ordered manner, using each step’s prediction as the condition for the next, thereby explicitly modeling temporal dependencies and hierarchical structures among editing operations. To enable training and evaluation, we synthesize multi-step manipulation data and construct a new benchmark HSIM. We further propose the HSS metric to assess sequential order and hierarchical alignment. Extensive experiments show RITA achieves SOTA on traditional benchmarks and provides a solid foundation for the novel hierarchical localization task, validating its potential as a general and effective paradigm. The code and dataset will be publicly available.
[196] PS3: A Multimodal Transformer Integrating Pathology Reports with Histology Images and Biological Pathways for Cancer Survival Prediction
Manahil Raza, Ayesha Azam, Talha Qaiser, Nasir Rajpoot
Main category: cs.CV
TL;DR: PS3 is a multimodal fusion model that integrates pathology reports, whole slide images (WSIs), and transcriptomic data for improved cancer survival prediction using prototype-based representations to address modality imbalance.
Details
Motivation: Current multimodal approaches focus on WSIs with genomic data, but pathology reports offer complementary clinical information. The authors hypothesize that incorporating pathology reports can enhance prognostic performance despite challenges from heterogeneous data types.Method: A prototype-based approach generates balanced representations: diagnostic prototypes from pathology reports using self-attention, histological prototypes for WSIs, and biological pathway prototypes for transcriptomic data. A Transformer-based fusion model (PS3) processes these multimodal tokens to model intra-modal and cross-modal interactions.
Result: PS3 outperforms state-of-the-art methods on six TCGA datasets, demonstrating superior performance against clinical, unimodal, and multimodal baselines for survival prediction.
Conclusion: The proposed three-modal fusion approach effectively integrates pathology reports, WSIs, and transcriptomic data, showing that incorporating pathology reports significantly enhances prognostic capabilities in computational oncology.
Abstract: Current multimodal fusion approaches in computational oncology primarily focus on integrating multi-gigapixel histology whole slide images (WSIs) with genomic or transcriptomic data, demonstrating improved survival prediction. We hypothesize that incorporating pathology reports can further enhance prognostic performance. Pathology reports, as essential components of clinical workflows, offer readily available complementary information by summarizing histopathological findings and integrating expert interpretations and clinical context. However, fusing these modalities poses challenges due to their heterogeneous nature. WSIs are high-dimensional, each containing several billion pixels, whereas pathology reports consist of concise text summaries of varying lengths, leading to potential modality imbalance. To address this, we propose a prototype-based approach to generate balanced representations, which are then integrated using a Transformer-based fusion model for survival prediction that we term PS3 (Predicting Survival from Three Modalities). Specifically, we present: (1) Diagnostic prototypes from pathology reports, leveraging self-attention to extract diagnostically relevant sections and standardize text representation; (2) Histological prototypes to compactly represent key morphological patterns in WSIs; and (3) Biological pathway prototypes to encode transcriptomic expressions, accurately capturing cellular functions. PS3, the three-modal transformer model, processes the resulting prototype-based multimodal tokens and models intra-modal and cross-modal interactions across pathology reports, WSIs and transcriptomic data. The proposed model outperforms state-of-the-art methods when evaluated against clinical, unimodal and multimodal baselines on six datasets from The Cancer Genome Atlas (TCGA). The code is available at: https://github.com/manahilr/PS3.
[197] Generative Adversarial Networks Applied for Privacy Preservation in Biometric-Based Authentication and Identification
Lubos Mjachky, Ivan Homoliak
Main category: cs.CV
TL;DR: A privacy-preserving authentication method using GANs to convert face images into visually private domains (e.g., flowers or shoes) for secure biometric authentication.
Details
Motivation: Traditional biometric systems lack user control over data usage and are vulnerable to data leaks and misuse, compromising user privacy.Method: Use generative adversarial networks (GANs) to translate face images into visually private domains, then train authentication classifiers on these transformed images.
Result: The method demonstrates robustness against attacks while maintaining meaningful utility for authentication purposes.
Conclusion: The proposed GAN-based approach provides a privacy-preserving alternative to conventional biometric authentication systems by transforming sensitive data into private domains.
Abstract: Biometric-based authentication systems are getting broadly adopted in many areas. However, these systems do not allow participating users to influence the way their data is used. Furthermore, the data may leak and can be misused without the users’ knowledge. In this paper, we propose a new authentication method that preserves the privacy of individuals and is based on a generative adversarial network (GAN). Concretely, we suggest using the GAN for translating images of faces to a visually private domain (e.g., flowers or shoes). Classifiers, which are used for authentication purposes, are then trained on the images from the visually private domain. Based on our experiments, the method is robust against attacks and still provides meaningful utility.
[198] Predictive Quality Assessment for Mobile Secure Graphics
Cas Steigstra, Sergey Milyaev, Shaodi You
Main category: cs.CV
TL;DR: A framework to predict image quality for secure graphic verification by estimating frame utility for downstream tasks, using a lightweight model to filter frames for resource-intensive verification.
Details
Motivation: Poor smartphone image acquisition of high-entropy security patterns causes high false rejection rates, creating a reliability gap in anti-counterfeiting systems.Method: Propose a lightweight quality scoring model to predict frame utility for verification tasks, validated on 32,000+ images from 105 smartphones using FNMR and ISRR metrics. Includes cross-domain analysis on different printing technologies.
Result: A frozen ImageNet-pretrained network with lightweight probe generalizes better to unseen printing technologies than fully fine-tuned models, showing robustness against domain shifts from physical manufacturing.
Conclusion: For domain shifts in physical manufacturing, frozen general-purpose backbones are more robust than full fine-tuning, which can overfit to source-domain artifacts, providing key insight for real-world generalization.
Abstract: The reliability of secure graphic verification, a key anti-counterfeiting tool, is undermined by poor image acquisition on smartphones. Uncontrolled user captures of these high-entropy patterns cause high false rejection rates, creating a significant ‘reliability gap’. To bridge this gap, we depart from traditional perceptual IQA and introduce a framework that predictively estimates a frame’s utility for the downstream verification task. We propose a lightweight model to predict a quality score for a video frame, determining its suitability for a resource-intensive oracle model. Our framework is validated using re-contextualized FNMR and ISRR metrics on a large-scale dataset of 32,000+ images from 105 smartphones. Furthermore, a novel cross-domain analysis on graphics from different industrial printing presses reveals a key finding: a lightweight probe on a frozen, ImageNet-pretrained network generalizes better to an unseen printing technology than a fully fine-tuned model. This provides a key insight for real-world generalization: for domain shifts from physical manufacturing, a frozen general-purpose backbone can be more robust than full fine-tuning, which can overfit to source-domain artifacts.
[199] SHMoAReg: Spark Deformable Image Registration via Spatial Heterogeneous Mixture of Experts and Attention Heads
Yuxi Zheng, Jianhui Feng, Tianran Li, Marius Staring, Yuchuan Qiao
Main category: cs.CV
TL;DR: SHMoAReg introduces Mixture of Experts (MoE) mechanism to Deformable Image Registration, using specialized attention heads in the encoder and heterogeneous experts in the decoder to improve feature extraction and deformation field prediction.
Details
Motivation: Current encoder-decoder DIR methods lack specialized feature extraction for registration tasks and predict deformation fields homogeneously in all three directions, limiting performance and interpretability.Method: Proposes SHMoAReg with Mixture of Attention heads (MoA) in encoder layers for dynamic attention selection, and Spatial Heterogeneous Mixture of Experts (SHMoE) in decoder layers for heterogeneous deformation prediction using experts with varying kernel sizes.
Result: Achieves consistent improvements on two public datasets, with Dice score increasing from 60.58% to 65.58% on abdominal CT dataset, while enhancing model interpretability through expert utility differentiation.
Conclusion: First successful application of MoE mechanism to DIR tasks, demonstrating improved registration performance and interpretability, with code to be released.
Abstract: Encoder-Decoder architectures are widely used in deep learning-based Deformable Image Registration (DIR), where the encoder extracts multi-scale features and the decoder predicts deformation fields by recovering spatial locations. However, current methods lack specialized extraction of features (that are useful for registration) and predict deformation jointly and homogeneously in all three directions. In this paper, we propose a novel expert-guided DIR network with Mixture of Experts (MoE) mechanism applied in both encoder and decoder, named SHMoAReg. Specifically, we incorporate Mixture of Attention heads (MoA) into encoder layers, while Spatial Heterogeneous Mixture of Experts (SHMoE) into the decoder layers. The MoA enhances the specialization of feature extraction by dynamically selecting the optimal combination of attention heads for each image token. Meanwhile, the SHMoE predicts deformation fields heterogeneously in three directions for each voxel using experts with varying kernel sizes. Extensive experiments conducted on two publicly available datasets show consistent improvements over various methods, with a notable increase from 60.58% to 65.58% in Dice score for the abdominal CT dataset. Furthermore, SHMoAReg enhances model interpretability by differentiating experts’ utilities across/within different resolution layers. To the best of our knowledge, we are the first to introduce MoE mechanism into DIR tasks. The code will be released soon.
[200] Unleashing the Potential of the Semantic Latent Space in Diffusion Models for Image Dehazing
Zizheng Yang, Hu Yu, Bing Li, Jinghao Zhang, Jie Huang, Feng Zhao
Main category: cs.CV
TL;DR: DiffLI²D: A novel image dehazing method that leverages pre-trained diffusion models’ latent representations to avoid retraining and reduce computational burden while achieving state-of-the-art performance.
Details
Motivation: Current diffusion-based dehazing methods suffer from massive computational costs due to retraining requirements and extensive sampling steps during inference, limiting their practical application.Method: Explores hazy image properties in frozen pre-trained diffusion models’ semantic latent space, integrates diffusion latent representations at different time-steps into a carefully designed dehazing network to provide guidance for image dehazing.
Result: Extensive experiments on multiple datasets demonstrate superior performance compared to existing image dehazing methods.
Conclusion: The proposed DiffLI²D offers a novel perspective for introducing diffusion models to image dehazing by effectively utilizing informative representations from pre-trained models without retraining or iterative sampling.
Abstract: Diffusion models have recently been investigated as powerful generative solvers for image dehazing, owing to their remarkable capability to model the data distribution. However, the massive computational burden imposed by the retraining of diffusion models, coupled with the extensive sampling steps during the inference, limit the broader application of diffusion models in image dehazing. To address these issues, we explore the properties of hazy images in the semantic latent space of frozen pre-trained diffusion models, and propose a Diffusion Latent Inspired network for Image Dehazing, dubbed DiffLI$^2$D. Specifically, we first reveal that the semantic latent space of pre-trained diffusion models can represent the content and haze characteristics of hazy images, as the diffusion time-step changes. Building upon this insight, we integrate the diffusion latent representations at different time-steps into a delicately designed dehazing network to provide instructions for image dehazing. Our DiffLI$^2$D avoids re-training diffusion models and iterative sampling process by effectively utilizing the informative representations derived from the pre-trained diffusion models, which also offers a novel perspective for introducing diffusion models to image dehazing. Extensive experiments on multiple datasets demonstrate that the proposed method achieves superior performance to existing image dehazing methods. Code is available at https://github.com/aaaasan111/difflid.
[201] Hyperspectral Adapter for Semantic Segmentation with Vision Foundation Models
JuanaJuana Valeria Hurtado, Rohit Mohan, Abhinav Valada
Main category: cs.CV
TL;DR: A novel hyperspectral adapter that leverages pretrained vision foundation models to achieve state-of-the-art semantic segmentation performance on hyperspectral imaging data for autonomous driving applications.
Details
Motivation: Current HSI semantic segmentation methods underperform because they rely on architectures optimized for RGB inputs, despite hyperspectral imaging's potential for robust robotic perception in challenging environments.Method: Proposes a hyperspectral adapter with spectral transformer and spectrum-aware spatial prior module, plus modality-aware interaction block that integrates hyperspectral representations with frozen vision Transformer features through extraction and injection mechanisms.
Result: Extensive evaluations on three benchmark autonomous driving datasets demonstrate state-of-the-art semantic segmentation performance, outperforming both vision-based and hyperspectral segmentation methods.
Conclusion: The proposed architecture effectively bridges the gap between hyperspectral data and pretrained vision models, enabling superior semantic segmentation performance for autonomous driving applications.
Abstract: Hyperspectral imaging (HSI) captures spatial information along with dense spectral measurements across numerous narrow wavelength bands. This rich spectral content has the potential to facilitate robust robotic perception, particularly in environments with complex material compositions, varying illumination, or other visually challenging conditions. However, current HSI semantic segmentation methods underperform due to their reliance on architectures and learning frameworks optimized for RGB inputs. In this work, we propose a novel hyperspectral adapter that leverages pretrained vision foundation models to effectively learn from hyperspectral data. Our architecture incorporates a spectral transformer and a spectrum-aware spatial prior module to extract rich spatial-spectral features. Additionally, we introduce a modality-aware interaction block that facilitates effective integration of hyperspectral representations and frozen vision Transformer features through dedicated extraction and injection mechanisms. Extensive evaluations on three benchmark autonomous driving datasets demonstrate that our architecture achieves state-of-the-art semantic segmentation performance while directly using HSI inputs, outperforming both vision-based and hyperspectral segmentation methods. We make the code available at https://hyperspectraladapter.cs.uni-freiburg.de.
[202] A Simple Data Augmentation Strategy for Text-in-Image Scientific VQA
Belal Shoer, Yova Kementchedjhieva
Main category: cs.CV
TL;DR: The paper addresses challenges in scientific visual question answering by converting separate image-text pairs into unified text-in-image format, enabling effective fine-tuning of multimodal models.
Details
Motivation: Scientific visual question answering is difficult for vision-language models due to complex figures and multimodal context. Existing approaches treat visual and textual content separately, and even state-of-the-art models perform poorly in zero-shot settings on text-in-image formats like EXAMS-V.Method: Synthesize a new dataset by converting existing separate image-text pairs into unified images. Fine-tune a small multilingual multimodal model on a mix of synthetic data and EXAMS-V dataset.
Result: Fine-tuning yields notable gains across 13 languages, demonstrating strong average improvements and effective cross-lingual transfer capabilities.
Conclusion: The text-in-image format combined with task-specific fine-tuning significantly improves scientific visual question answering performance across multiple languages, addressing the data scarcity problem in this challenging domain.
Abstract: Scientific visual question answering poses significant challenges for vision-language models due to the complexity of scientific figures and their multimodal context. Traditional approaches treat the figure and accompanying text (e.g., questions and answer options) as separate inputs. EXAMS-V introduced a new paradigm by embedding both visual and textual content into a single image. However, even state-of-the-art proprietary models perform poorly on this setup in zero-shot settings, underscoring the need for task-specific fine-tuning. To address the scarcity of training data in this “text-in-image” format, we synthesize a new dataset by converting existing separate image-text pairs into unified images. Fine-tuning a small multilingual multimodal model on a mix of our synthetic data and EXAMS-V yields notable gains across 13 languages, demonstrating strong average improvements and cross-lingual transfer.
[203] EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models
Botai Yuan, Yutian Zhou, Yingjie Wang, Fushuo Huo, Yongcheng Jing, Li Shen, Ying Wei, Zhiqi Shen, Ziwei Liu, Tianwei Zhang, Jie Yang, Dacheng Tao
Main category: cs.CV
TL;DR: EchoBench is introduced to evaluate sycophancy (uncritical echoing of user bias) in medical LVLMs, revealing high susceptibility across models despite moderate accuracy, with mitigation strategies proposed.
Details
Motivation: Current medical LVLM benchmarks focus on accuracy but overlook reliability and safety, particularly sycophancy in clinical settings where biased inputs can have high-stakes consequences.Method: EchoBench contains 2,122 images across 18 departments and 20 modalities with 90 prompts simulating biased inputs. Medical, open-source, and proprietary LVLMs are evaluated, with fine-grained analysis by bias type, department, granularity, and modality.
Result: All models show substantial sycophancy (45.98% for Claude 3.7 Sonnet, 59.15% for GPT-4.1, many medical-specific models >95%). Higher data quality/diversity and domain knowledge reduce sycophancy without harming unbiased accuracy.
Conclusion: Robust evaluation beyond accuracy is needed; prompt-level interventions (negative prompting, one-shot, few-shot) reduce sycophancy, motivating training/decoding strategies for safer medical LVLMs.
Abstract: Recent benchmarks for medical Large Vision-Language Models (LVLMs) emphasize leaderboard accuracy, overlooking reliability and safety. We study sycophancy – models’ tendency to uncritically echo user-provided information – in high-stakes clinical settings. We introduce EchoBench, a benchmark to systematically evaluate sycophancy in medical LVLMs. It contains 2,122 images across 18 departments and 20 modalities with 90 prompts that simulate biased inputs from patients, medical students, and physicians. We evaluate medical-specific, open-source, and proprietary LVLMs. All exhibit substantial sycophancy; the best proprietary model (Claude 3.7 Sonnet) still shows 45.98% sycophancy, and GPT-4.1 reaches 59.15%. Many medical-specific models exceed 95% sycophancy despite only moderate accuracy. Fine-grained analyses by bias type, department, perceptual granularity, and modality identify factors that increase susceptibility. We further show that higher data quality/diversity and stronger domain knowledge reduce sycophancy without harming unbiased accuracy. EchoBench also serves as a testbed for mitigation: simple prompt-level interventions (negative prompting, one-shot, few-shot) produce consistent reductions and motivate training- and decoding-time strategies. Our findings highlight the need for robust evaluation beyond accuracy and provide actionable guidance toward safer, more trustworthy medical LVLMs.
[204] Smaller is Better: Enhancing Transparency in Vehicle AI Systems via Pruning
Sanish Suwal, Shaurya Garg, Dipkamal Bhusal, Michael Clifford, Nidhi Rastogi
Main category: cs.CV
TL;DR: Pruning significantly improves the quality and faithfulness of post-hoc explanations for traffic sign classifiers compared to natural and adversarial training.
Details
Motivation: Connected and autonomous vehicles rely on AI systems where transparency and security are critical, but current post-hoc explanations suffer from inconsistencies and lack of faithfulness in representing model decisions.Method: Systematically examined three training approaches (natural training, adversarial training, and pruning) on traffic sign classifiers, evaluating their impact on explanation quality using saliency maps through extensive empirical evaluation.
Result: Pruning significantly enhances the comprehensibility and faithfulness of explanations by enforcing sparsity in learned representation, leading to more interpretable and reliable decisions.
Conclusion: Pruning is a promising strategy for developing transparent deep learning models, especially in resource-constrained vehicular AI systems, as it improves both model efficiency and explanation quality.
Abstract: Connected and autonomous vehicles continue to heavily rely on AI systems, where transparency and security are critical for trust and operational safety. Post-hoc explanations provide transparency to these black-box like AI models but the quality and reliability of these explanations is often questioned due to inconsistencies and lack of faithfulness in representing model decisions. This paper systematically examines the impact of three widely used training approaches, namely natural training, adversarial training, and pruning, affect the quality of post-hoc explanations for traffic sign classifiers. Through extensive empirical evaluation, we demonstrate that pruning significantly enhances the comprehensibility and faithfulness of explanations (using saliency maps). Our findings reveal that pruning not only improves model efficiency but also enforces sparsity in learned representation, leading to more interpretable and reliable decisions. Additionally, these insights suggest that pruning is a promising strategy for developing transparent deep learning models, especially in resource-constrained vehicular AI systems.
[205] C$^2$MIL: Synchronizing Semantic and Topological Causalities in Multiple Instance Learning for Robust and Interpretable Survival Analysis
Min Cen, Zhenfeng Zhuang, Yuzhe Zhang, Min Zeng, Baptiste Magnier, Lequan Yu, Hong Zhang, Liansheng Wang
Main category: cs.CV
TL;DR: C²MIL is a novel dual causal graph-based MIL model that addresses semantic bias and topological noise in survival analysis with WSIs through semantic causal intervention and topological causal discovery.
Details
Motivation: Variations in staining/scanning introduce semantic bias, and irrelevant topological subgraphs create noise, leading to biased slide-level representations that hinder interpretability and generalization in graph-based MIL for survival analysis.Method: Proposes C²MIL with: 1) cross-scale adaptive feature disentangling module for semantic causal intervention, 2) Bernoulli differentiable causal subgraph sampling for topological causal discovery, and 3) joint optimization combining disentangling supervision and contrastive learning.
Result: Experiments show C²MIL consistently improves generalization and interpretability over existing methods and can serve as a causal enhancement for diverse MIL baselines.
Conclusion: C²MIL effectively addresses semantic and topological biases through dual causal modeling, providing improved generalization and interpretability for graph-based MIL in survival analysis with WSIs.
Abstract: Graph-based Multiple Instance Learning (MIL) is widely used in survival analysis with Hematoxylin and Eosin (H&E)-stained whole slide images (WSIs) due to its ability to capture topological information. However, variations in staining and scanning can introduce semantic bias, while topological subgraphs that are not relevant to the causal relationships can create noise, resulting in biased slide-level representations. These issues can hinder both the interpretability and generalization of the analysis. To tackle this, we introduce a dual structural causal model as the theoretical foundation and propose a novel and interpretable dual causal graph-based MIL model, C$^2$MIL. C$^2$MIL incorporates a novel cross-scale adaptive feature disentangling module for semantic causal intervention and a new Bernoulli differentiable causal subgraph sampling method for topological causal discovery. A joint optimization strategy combining disentangling supervision and contrastive learning enables simultaneous refinement of both semantic and topological causalities. Experiments demonstrate that C$^2$MIL consistently improves generalization and interpretability over existing methods and can serve as a causal enhancement for diverse MIL baselines. The code is available at https://github.com/mimic0127/C2MIL.
[206] U-Mamba2-SSL for Semi-Supervised Tooth and Pulp Segmentation in CBCT
Zhi Qin Tan, Xiatian Zhu, Owen Addison, Yunpeng Li
Main category: cs.CV
TL;DR: U-Mamba2-SSL is a semi-supervised learning framework for automated teeth and pulp segmentation in CBCT scans, achieving high accuracy with limited labeled data.
Details
Motivation: Manual segmentation of teeth and pulp in CBCT scans requires extensive expertise and is time-consuming, creating a need for automated algorithms that can effectively utilize unlabeled data.Method: The framework builds on U-Mamba2 model with multi-stage training: self-supervised pre-training using disruptive autoencoder, consistency regularization with input/feature perturbations, and pseudo-labeling with reduced loss weighting.
Result: U-Mamba2-SSL achieved an average score of 0.872 and a DSC of 0.969 on the validation dataset, demonstrating superior performance.
Conclusion: The proposed semi-supervised learning framework effectively leverages unlabeled data for accurate teeth and pulp segmentation, providing a practical solution for clinical applications.
Abstract: Accurate segmentation of teeth and pulp in Cone-Beam Computed Tomography (CBCT) is vital for clinical applications like treatment planning and diagnosis. However, this process requires extensive expertise and is exceptionally time-consuming, highlighting the critical need for automated algorithms that can effectively utilize unlabeled data. In this paper, we propose U-Mamba2-SSL, a novel semi-supervised learning framework that builds on the U-Mamba2 model and employs a multi-stage training strategy. The framework first pre-trains U-Mamba2 in a self-supervised manner using a disruptive autoencoder. It then leverages unlabeled data through consistency regularization, where we introduce input and feature perturbations to ensure stable model outputs. Finally, a pseudo-labeling strategy is implemented with a reduced loss weighting to minimize the impact of potential errors. U-Mamba2-SSL achieved an average score of 0.872 and a DSC of 0.969 on the validation dataset, demonstrating the superior performance of our approach. The code is available at https://github.com/zhiqin1998/UMamba2.
[207] Optical Ocean Recipes: Creating Realistic Datasets to Facilitate Underwater Vision Research
Patricia Schöntag, David Nakath, Judith Fischer, Rüdiger Röttgers, Kevin Köser
Main category: cs.CV
TL;DR: Optical Ocean Recipes framework for creating realistic underwater datasets with controlled conditions to address challenges in machine vision evaluation.
Details
Motivation: Current machine vision testing in underwater environments lacks generalizability due to optical challenges and varying water conditions, making exhaustive open-water testing impractical.Method: Developed a framework using calibrated color and scattering additives to create repeatable, controlled underwater testing environments that simulate realistic optical conditions.
Result: Created a demonstration dataset enabling ground-truth data generation for various vision tasks including water parameter estimation, image restoration, segmentation, visual SLAM, and underwater image synthesis.
Conclusion: The Optical Ocean Recipes provide a unique, controlled framework for analyzing machine vision in realistic underwater scenarios, with dataset and evaluation code to be made available.
Abstract: The development and evaluation of machine vision in underwater environments remains challenging, often relying on trial-and-error-based testing tailored to specific applications. This is partly due to the lack of controlled, ground-truthed testing environments that account for the optical challenges, such as color distortion from spectrally variant light attenuation, reduced contrast and blur from backscatter and volume scattering, and dynamic light patterns from natural or artificial illumination. Additionally, the appearance of ocean water in images varies significantly across regions, depths, and seasons. However, most machine vision evaluations are conducted under specific optical water types and imaging conditions, therefore often lack generalizability. Exhaustive testing across diverse open-water scenarios is technically impractical. To address this, we introduce the \textit{Optical Ocean Recipes}, a framework for creating realistic datasets under controlled underwater conditions. Unlike synthetic or open-water data, these recipes, using calibrated color and scattering additives, enable repeatable and controlled testing of the impact of water composition on image appearance. Hence, this provides a unique framework for analyzing machine vision in realistic, yet controlled underwater scenarios. The controlled environment enables the creation of ground-truth data for a range of vision tasks, including water parameter estimation, image restoration, segmentation, visual SLAM, and underwater image synthesis. We provide a demonstration dataset generated using the Optical Ocean Recipes and briefly demonstrate the use of our system for two underwater vision tasks. The dataset and evaluation code will be made available.
[208] Universal Camouflage Attack on Vision-Language Models for Autonomous Driving
Dehong Kong, Sifan Yu, Siyuan Liang, Jiawei Liang, Jianhou Gan, Aishan Liu, Wenqi Ren
Main category: cs.CV
TL;DR: UCA is the first Universal Camouflage Attack framework for Visual Language Modeling in Automated Driving that generates physically realizable camouflage textures operating in feature space rather than logit layer, achieving strong generalization across different commands and models.
Details
Motivation: Existing adversarial attacks have limitations: physical attacks target vision modules but don't transfer well to VLM-AD systems, while digital attacks against VLM-AD lack physical realizability. VLM-AD systems show vulnerability in encoder and projection layers.Method: UCA introduces feature divergence loss (FDL) to maximize representational discrepancy between clean and adversarial images. It uses multi-scale learning strategy and adjusts sampling ratio to enhance adaptability to scale/viewpoint changes and improve training stability.
Result: Extensive experiments show UCA induces incorrect driving commands across various VLM-AD models and scenarios, surpassing state-of-the-art methods by 30% in 3-P metrics. Demonstrates strong robustness under diverse viewpoints and dynamic conditions.
Conclusion: UCA represents a practical and effective adversarial attack framework for VLM-AD systems with high potential for real-world deployment due to its physical realizability and strong generalization capabilities.
Abstract: Visual language modeling for automated driving is emerging as a promising research direction with substantial improvements in multimodal reasoning capabilities. Despite its advanced reasoning abilities, VLM-AD remains vulnerable to serious security threats from adversarial attacks, which involve misleading model decisions through carefully crafted perturbations. Existing attacks have obvious challenges: 1) Physical adversarial attacks primarily target vision modules. They are difficult to directly transfer to VLM-AD systems because they typically attack low-level perceptual components. 2) Adversarial attacks against VLM-AD have largely concentrated on the digital level. To address these challenges, we propose the first Universal Camouflage Attack (UCA) framework for VLM-AD. Unlike previous methods that focus on optimizing the logit layer, UCA operates in the feature space to generate physically realizable camouflage textures that exhibit strong generalization across different user commands and model architectures. Motivated by the observed vulnerability of encoder and projection layers in VLM-AD, UCA introduces a feature divergence loss (FDL) that maximizes the representational discrepancy between clean and adversarial images. In addition, UCA incorporates a multi-scale learning strategy and adjusts the sampling ratio to enhance its adaptability to changes in scale and viewpoint diversity in real-world scenarios, thereby improving training stability. Extensive experiments demonstrate that UCA can induce incorrect driving commands across various VLM-AD models and driving scenarios, significantly surpassing existing state-of-the-art attack methods (improving 30% in 3-P metrics). Furthermore, UCA exhibits strong attack robustness under diverse viewpoints and dynamic conditions, indicating high potential for practical deployment.
[209] PU-Gaussian: Point Cloud Upsampling using 3D Gaussian Representation
Mahmoud Khater, Mona Strauss, Philipp von Olshausen, Alexander Reiterer
Main category: cs.CV
TL;DR: PU-Gaussian is a novel point cloud upsampling method that uses anisotropic 3D Gaussian distributions to model local geometry, enabling explicit upsampling through direct point sampling followed by refinement.
Details
Motivation: Existing point cloud upsampling methods often sacrifice geometric interpretability or robustness to input sparsity. Sparse and noisy 3D point clouds from sensors need dense, high-fidelity representations for downstream tasks.Method: Models local neighborhoods using anisotropic 3D Gaussian distributions, performs explicit upsampling by direct point sampling to generate dense coarse output, then refines with a network for uniform distribution and sharp edges.
Result: Achieves state-of-the-art performance on PU1K and PUGAN datasets, demonstrating superior upsampling quality compared to prior methods.
Conclusion: PU-Gaussian provides an effective solution for point cloud upsampling that maintains geometric interpretability while being robust to input sparsity, with publicly available code and models.
Abstract: Point clouds produced by 3D sensors are often sparse and noisy, posing challenges for tasks requiring dense and high-fidelity 3D representations. Prior work has explored both implicit feature-based upsampling and distance-function learning to address this, but often at the expense of geometric interpretability or robustness to input sparsity. To overcome these limitations, we propose PU-Gaussian, a novel upsampling network that models the local neighborhood around each point using anisotropic 3D Gaussian distributions. These Gaussians capture the underlying geometric structure, allowing us to perform upsampling explicitly in the local geometric domain by direct point sampling. The sampling process generates a dense, but coarse, point cloud. A subsequent refinement network adjusts the coarse output to produce a more uniform distribution and sharper edges. We perform extensive testing on the PU1K and PUGAN datasets, demonstrating that PU-Gaussian achieves state-of-the-art performance. We make code and model weights publicly available at https://github.com/mvg-inatech/PU-Gaussian.git.
[210] ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression
Tom Burgert, Oliver Stoll, Paolo Rota, Begüm Demir
Main category: cs.CV
TL;DR: CNNs are not inherently texture-biased as previously thought; they primarily rely on local shape features, and this reliance can be mitigated with modern training or architectures. Feature reliance patterns vary across domains: computer vision prioritizes shape, medical imaging emphasizes color, and remote sensing focuses on texture.
Details
Motivation: To challenge the established hypothesis that CNNs are texture-biased by addressing limitations in previous cue-conflict experiments and develop a more robust framework to quantify feature reliance.Method: Proposed a domain-agnostic framework that systematically suppresses shape, texture, and color cues to quantify feature reliance without forced-choice conflicts. Evaluated humans and neural networks under controlled suppression conditions across computer vision, medical imaging, and remote sensing domains.
Result: CNNs predominantly rely on local shape features, not texture, and this reliance can be reduced with modern training strategies or architectures like ConvNeXt and ViTs. Domain-specific patterns emerged: computer vision models prioritize shape, medical imaging models emphasize color, and remote sensing models rely more on texture.
Conclusion: The texture-bias hypothesis for CNNs is oversimplified; feature reliance is context-dependent and varies systematically across domains, highlighting the need for domain-aware model design and evaluation.
Abstract: The hypothesis that Convolutional Neural Networks (CNNs) are inherently texture-biased has shaped much of the discourse on feature use in deep learning. We revisit this hypothesis by examining limitations in the cue-conflict experiment by Geirhos et al. To address these limitations, we propose a domain-agnostic framework that quantifies feature reliance through systematic suppression of shape, texture, and color cues, avoiding the confounds of forced-choice conflicts. By evaluating humans and neural networks under controlled suppression conditions, we find that CNNs are not inherently texture-biased but predominantly rely on local shape features. Nonetheless, this reliance can be substantially mitigated through modern training strategies or architectures (ConvNeXt, ViTs). We further extend the analysis across computer vision, medical imaging, and remote sensing, revealing that reliance patterns differ systematically: computer vision models prioritize shape, medical imaging models emphasize color, and remote sensing models exhibit a stronger reliance towards texture. Code is available at https://github.com/tomburgert/feature-reliance.
[211] An Anisotropic Cross-View Texture Transfer with Multi-Reference Non-Local Attention for CT Slice Interpolation
Kwang-Hyun Uhm, Hyunjun Cho, Sung-Hoo Hong, Seung-Won Jung
Main category: cs.CV
TL;DR: A novel cross-view texture transfer approach for CT slice interpolation that leverages anisotropic characteristics of 3D CT volumes by transferring high-resolution in-plane texture details to low-resolution through-plane images.
Details
Motivation: Clinical CT images are often acquired with large slice thicknesses due to storage and time constraints, resulting in anisotropic volumes with poor inter-slice resolution that can hinder disease diagnosis. Existing methods don't fully exploit the anisotropic nature of 3D CT volumes.Method: Proposes a framework that uses high-resolution in-plane texture as reference to enhance low-resolution through-plane images. Introduces a multi-reference non-local attention module to extract meaningful features for reconstructing through-plane high-frequency details from multiple in-plane images.
Result: The method performs significantly better than existing competing methods on public CT datasets, including a real-paired benchmark, demonstrating effectiveness in CT slice interpolation.
Conclusion: The proposed cross-view texture transfer approach effectively utilizes the anisotropic characteristics of 3D CT volumes to improve inter-slice resolution, with verified performance superiority over existing methods.
Abstract: Computed tomography (CT) is one of the most widely used non-invasive imaging modalities for medical diagnosis. In clinical practice, CT images are usually acquired with large slice thicknesses due to the high cost of memory storage and operation time, resulting in an anisotropic CT volume with much lower inter-slice resolution than in-plane resolution. Since such inconsistent resolution may lead to difficulties in disease diagnosis, deep learning-based volumetric super-resolution methods have been developed to improve inter-slice resolution. Most existing methods conduct single-image super-resolution on the through-plane or synthesize intermediate slices from adjacent slices; however, the anisotropic characteristic of 3D CT volume has not been well explored. In this paper, we propose a novel cross-view texture transfer approach for CT slice interpolation by fully utilizing the anisotropic nature of 3D CT volume. Specifically, we design a unique framework that takes high-resolution in-plane texture details as a reference and transfers them to low-resolution through-plane images. To this end, we introduce a multi-reference non-local attention module that extracts meaningful features for reconstructing through-plane high-frequency details from multiple in-plane images. Through extensive experiments, we demonstrate that our method performs significantly better in CT slice interpolation than existing competing methods on public CT datasets including a real-paired benchmark, verifying the effectiveness of the proposed framework. The source code of this work is available at https://github.com/khuhm/ACVTT.
[212] 4D Driving Scene Generation With Stereo Forcing
Hao Lu, Zhuang Ma, Guangfeng Jiang, Wenhang Ge, Bohan Li, Yuzhan Cai, Wenzhao Zheng, Yunpeng Zhang, Yingcong Chen
Main category: cs.CV
TL;DR: PhiGenesis is a unified framework for 4D scene generation that bridges video generation and novel view synthesis, producing temporally continuous 4D Gaussian splatting representations from multi-view inputs.
Details
Motivation: Current generative models struggle with dynamic 4D driving scenes that support both temporal extrapolation and spatial novel view synthesis without per-scene optimization. Bridging generation and NVS remains a major challenge.Method: Two-stage approach: 1) Pre-trained video VAE with range-view adapter for feed-forward 4D reconstruction from multi-view images; 2) Geometric-guided video diffusion model using rendered historical scenes as priors for future view generation, with Stereo Forcing conditioning strategy to address geometric exposure bias.
Result: Achieves state-of-the-art performance in appearance/geometric reconstruction, temporal generation, and novel view synthesis tasks, with competitive downstream evaluation performance.
Conclusion: PhiGenesis successfully bridges the gap between generation and novel view synthesis, enabling temporally coherent 4D scene generation from multi-view inputs while handling geometric uncertainty through innovative conditioning strategies.
Abstract: Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization. Bridging generation and novel view synthesis remains a major challenge. We present PhiGenesis, a unified framework for 4D scene generation that extends video generation techniques with geometric and temporal consistency. Given multi-view image sequences and camera parameters, PhiGenesis produces temporally continuous 4D Gaussian splatting representations along target 3D trajectories. In its first stage, PhiGenesis leverages a pre-trained video VAE with a novel range-view adapter to enable feed-forward 4D reconstruction from multi-view images. This architecture supports single-frame or video inputs and outputs complete 4D scenes including geometry, semantics, and motion. In the second stage, PhiGenesis introduces a geometric-guided video diffusion model, using rendered historical 4D scenes as priors to generate future views conditioned on trajectories. To address geometric exposure bias in novel views, we propose Stereo Forcing, a novel conditioning strategy that integrates geometric uncertainty during denoising. This method enhances temporal coherence by dynamically adjusting generative influence based on uncertainty-aware perturbations. Our experimental results demonstrate that our method achieves state-of-the-art performance in both appearance and geometric reconstruction, temporal generation and novel view synthesis (NVS) tasks, while simultaneously delivering competitive performance in downstream evaluations. Homepage is at \href{https://jiangxb98.github.io/PhiGensis}{PhiGensis}.
[213] A Versatile Foundation Model for AI-enabled Mammogram Interpretation
Fuxiang Huang, Jiayi Zhu, Yunfang Yu, Yu Xie, Yuan Guo, Qingcong Kong, Mingxiang Wu, Xinrui Jiang, Shu Yang, Jiabo Ma, Ziyi Liu, Zhe Xu, Zhixuan Chen, Yujie Tan, Zifan He, Luhui Mao, Xi Wang, Junlin Hou, Lei Zhang, Qiong Luo, Zhenhui Li, Herui Yao, Hao Chen
Main category: cs.CV
TL;DR: VersaMammo is a versatile foundation model for mammograms that achieves state-of-the-art performance across 92 clinical tasks using a two-stage pre-training strategy on the largest multi-institutional mammogram dataset.
Details
Motivation: Current foundation models for mammogram analysis face limitations including insufficient training data diversity, limited generalizability, and lack of comprehensive clinical evaluation, hindering clinical translation.Method: Two-stage pre-training: first self-supervised learning on unlabeled mammograms to train a teacher model, then supervised learning with knowledge distillation to transfer features and clinical knowledge. Uses largest multi-institutional dataset (706,239 images from 21 sources).
Result: Achieves SOTA performance, ranking first in 50/68 internal tasks and 20/24 external validation tasks, with average ranks of 1.5 and 1.2 respectively across 5 clinical task categories.
Conclusion: VersaMammo demonstrates superior generalization and clinical utility, representing a substantial advancement toward reliable and scalable breast cancer screening and diagnosis.
Abstract: Breast cancer is the most commonly diagnosed cancer and the leading cause of cancer-related mortality in women globally. Mammography is essential for the early detection and diagnosis of breast lesions. Despite recent progress in foundation models (FMs) for mammogram analysis, their clinical translation remains constrained by several fundamental limitations, including insufficient diversity in training data, limited model generalizability, and a lack of comprehensive evaluation across clinically relevant tasks. Here, we introduce VersaMammo, a versatile foundation model for mammograms, designed to overcome these limitations. We curated the largest multi-institutional mammogram dataset to date, comprising 706,239 images from 21 sources. To improve generalization, we propose a two-stage pre-training strategy to develop VersaMammo, a mammogram foundation model. First, a teacher model is trained via self-supervised learning to extract transferable features from unlabeled mammograms. Then, supervised learning combined with knowledge distillation transfers both features and clinical knowledge into VersaMammo. To ensure a comprehensive evaluation, we established a benchmark comprising 92 specific tasks, including 68 internal tasks and 24 external validation tasks, spanning 5 major clinical task categories: lesion detection, segmentation, classification, image retrieval, and visual question answering. VersaMammo achieves state-of-the-art performance, ranking first in 50 out of 68 specific internal tasks and 20 out of 24 external validation tasks, with average ranks of 1.5 and 1.2, respectively. These results demonstrate its superior generalization and clinical utility, offering a substantial advancement toward reliable and scalable breast cancer screening and diagnosis.
[214] A co-evolving agentic AI system for medical imaging analysis
Songhao Li, Jonathan Xu, Tiancheng Bao, Yuxuan Liu, Yuchen Liu, Yihang Liu, Lilin Wang, Wenhui Lei, Sheng Wang, Yinuo Xu, Yan Cui, Jialu Yao, Shunsuke Koga, Zhi Huang
Main category: cs.CV
TL;DR: TissueLab is a co-evolving agentic AI system for medical image analysis that integrates tools across pathology, radiology, and spatial omics, enabling real-time interactive analysis with expert feedback and achieving state-of-the-art performance.
Details
Motivation: Current agentic AI systems in medical image analysis face limitations due to lack of robust ecosystem, insufficient toolsets, and absence of real-time interactive expert feedback, hindering their performance and adoption.Method: TissueLab integrates tool factories across multiple medical domains, standardizes inputs/outputs/capabilities of diverse tools, and uses a co-evolving system that allows direct questioning, automatic workflow planning, real-time analysis with expert visualization and refinement capabilities.
Result: TissueLab achieves state-of-the-art performance compared to end-to-end VLMs and other agentic AI systems across diverse clinical tasks, and can deliver accurate results in unseen disease contexts within minutes through active learning without massive datasets.
Conclusion: As an open-source ecosystem, TissueLab establishes a foundation for next-generation medical AI by accelerating computational research and translational adoption in medical imaging through continuous learning from clinicians.
Abstract: Agentic AI is rapidly advancing in healthcare and biomedical research. However, in medical image analysis, their performance and adoption remain limited due to the lack of a robust ecosystem, insufficient toolsets, and the absence of real-time interactive expert feedback. Here we present “TissueLab”, a co-evolving agentic AI system that allows researchers to ask direct questions, automatically plan and generate explainable workflows, and conduct real-time analyses where experts can visualize intermediate results and refine them. TissueLab integrates tool factories across pathology, radiology, and spatial omics domains. By standardizing inputs, outputs, and capabilities of diverse tools, the system determines when and how to invoke them to address research and clinical questions. Across diverse tasks with clinically meaningful quantifications that inform staging, prognosis, and treatment planning, TissueLab achieves state-of-the-art performance compared with end-to-end vision-language models (VLMs) and other agentic AI systems such as GPT-5. Moreover, TissueLab continuously learns from clinicians, evolving toward improved classifiers and more effective decision strategies. With active learning, it delivers accurate results in unseen disease contexts within minutes, without requiring massive datasets or prolonged retraining. Released as a sustainable open-source ecosystem, TissueLab aims to accelerate computational research and translational adoption in medical imaging while establishing a foundation for the next generation of medical AI.
[215] HiPerformer: A High-Performance Global-Local Segmentation Model with Modular Hierarchical Fusion Strategy
Dayu Tan, Zhenpeng Xu, Yansen Su, Xin Peng, Chunhou Zheng, Weimin Zhong
Main category: cs.CV
TL;DR: HiPerformer is a novel medical image segmentation method that addresses feature inconsistency in CNN-Transformer hybrid architectures through modular hierarchical design, Local-Global Feature Fusion, and Progressive Pyramid Aggregation modules.
Details
Motivation: Existing CNN-Transformer hybrid methods use simple feature fusion techniques that struggle with feature inconsistencies, leading to information conflict and loss when integrating local details and global context in medical image segmentation.Method: HiPerformer employs a modular hierarchical encoder for dynamic parallel fusion of multi-source features, a Local-Global Feature Fusion (LGFF) module for precise integration, and a Progressive Pyramid Aggregation (PPA) module to replace traditional skip connections for better multi-scale feature representation.
Result: Experiments on eleven public datasets show that HiPerformer outperforms existing segmentation techniques with higher accuracy and robustness.
Conclusion: The proposed method effectively addresses feature inconsistency problems in medical image segmentation and achieves superior performance through innovative architectural design and fusion strategies.
Abstract: Both local details and global context are crucial in medical image segmentation, and effectively integrating them is essential for achieving high accuracy. However, existing mainstream methods based on CNN-Transformer hybrid architectures typically employ simple feature fusion techniques such as serial stacking, endpoint concatenation, or pointwise addition, which struggle to address the inconsistencies between features and are prone to information conflict and loss. To address the aforementioned challenges, we innovatively propose HiPerformer. The encoder of HiPerformer employs a novel modular hierarchical architecture that dynamically fuses multi-source features in parallel, enabling layer-wise deep integration of heterogeneous information. The modular hierarchical design not only retains the independent modeling capability of each branch in the encoder, but also ensures sufficient information transfer between layers, effectively avoiding the degradation of features and information loss that come with traditional stacking methods. Furthermore, we design a Local-Global Feature Fusion (LGFF) module to achieve precise and efficient integration of local details and global semantic information, effectively alleviating the feature inconsistency problem and resulting in a more comprehensive feature representation. To further enhance multi-scale feature representation capabilities and suppress noise interference, we also propose a Progressive Pyramid Aggregation (PPA) module to replace traditional skip connections. Experiments on eleven public datasets demonstrate that the proposed method outperforms existing segmentation techniques, demonstrating higher segmentation accuracy and robustness. The code is available at https://github.com/xzphappy/HiPerformer.
[216] PerFace: Metric Learning in Perceptual Facial Similarity for Enhanced Face Anonymization
Haruka Kumagai, Leslie Wöhler, Satoshi Ikehata, Kiyoharu Aizawa
Main category: cs.CV
TL;DR: This paper proposes a human-perception-based face similarity metric to address the limitations of existing binary identity classification methods in face anonymization.
Details
Motivation: Existing face anonymization models use binary identity classification (same person or not), which fails to capture nuanced similarities needed to balance anonymity and naturalness in face swapping.Method: Created a dataset of 6,400 triplet annotations and used metric learning to predict face similarity based on human perception.
Result: Experimental results show significant improvements in both face similarity prediction and attribute-based face classification tasks compared to existing methods.
Conclusion: The proposed human-perception-based similarity metric effectively addresses the nuanced similarity measurement needed for optimal face anonymization.
Abstract: In response to rising societal awareness of privacy concerns, face anonymization techniques have advanced, including the emergence of face-swapping methods that replace one identity with another. Achieving a balance between anonymity and naturalness in face swapping requires careful selection of identities: overly similar faces compromise anonymity, while dissimilar ones reduce naturalness. Existing models, however, focus on binary identity classification “the same person or not”, making it difficult to measure nuanced similarities such as “completely different” versus “highly similar but different.” This paper proposes a human-perception-based face similarity metric, creating a dataset of 6,400 triplet annotations and metric learning to predict the similarity. Experimental results demonstrate significant improvements in both face similarity prediction and attribute-based face classification tasks over existing methods.
[217] FAST: Foreground-aware Diffusion with Accelerated Sampling Trajectory for Segmentation-oriented Anomaly Synthesis
Xichen Xu, Yanshu Wang, Jinbao Wang, Xiaoning Lei, Guoyang Xie, Guannan Jiang, Zhichao Lu
Main category: cs.CV
TL;DR: FAST is a foreground-aware diffusion framework for industrial anomaly segmentation that uses novel AIAS and FARM modules to efficiently generate high-quality anomalies in just 10 sampling steps while preserving localized anomaly signals.
Details
Motivation: Existing industrial anomaly synthesis methods struggle with balancing sampling efficiency and generation quality, and treat all spatial regions uniformly, overlooking statistical differences between anomaly and background areas, which hinders controllable, structure-specific anomaly synthesis for segmentation tasks.Method: Proposes FAST framework with two modules: AIAS (Anomaly-Informed Accelerated Sampling) - a training-free sampling algorithm using coarse-to-fine aggregation for efficient synthesis; and FARM (Foreground-Aware Reconstruction Module) - adaptively adjusts anomaly-aware noise in masked foreground regions to preserve localized anomaly signals during denoising.
Result: Extensive experiments on multiple industrial benchmarks show that FAST consistently outperforms existing anomaly synthesis methods in downstream segmentation tasks, achieving state-of-the-art performance with only 10 sampling steps.
Conclusion: FAST provides an effective solution for segmentation-oriented industrial anomaly synthesis by addressing the limitations of existing methods through foreground-aware processing and efficient sampling, demonstrating superior performance across various industrial benchmarks.
Abstract: Industrial anomaly segmentation relies heavily on pixel-level annotations, yet real-world anomalies are often scarce, diverse, and costly to label. Segmentation-oriented industrial anomaly synthesis (SIAS) has emerged as a promising alternative; however, existing methods struggle to balance sampling efficiency and generation quality. Moreover, most approaches treat all spatial regions uniformly, overlooking the distinct statistical differences between anomaly and background areas. This uniform treatment hinders the synthesis of controllable, structure-specific anomalies tailored for segmentation tasks. In this paper, we propose FAST, a foreground-aware diffusion framework featuring two novel modules: the Anomaly-Informed Accelerated Sampling (AIAS) and the Foreground-Aware Reconstruction Module (FARM). AIAS is a training-free sampling algorithm specifically designed for segmentation-oriented industrial anomaly synthesis, which accelerates the reverse process through coarse-to-fine aggregation and enables the synthesis of state-of-the-art segmentation-oriented anomalies in as few as 10 steps. Meanwhile, FARM adaptively adjusts the anomaly-aware noise within the masked foreground regions at each sampling step, preserving localized anomaly signals throughout the denoising trajectory. Extensive experiments on multiple industrial benchmarks demonstrate that FAST consistently outperforms existing anomaly synthesis methods in downstream segmentation tasks. We release the code at: https://anonymous.4open.science/r/NeurIPS-938.
[218] A Comprehensive Evaluation of YOLO-based Deer Detection Performance on Edge Devices
Bishal Adhikari, Jiajia Li, Eric S. Michel, Jacob Dykes, Te-Ming Paul Tseng, Mary Love Tagert, Dong Chen
Main category: cs.CV
TL;DR: This paper addresses deer intrusion in agriculture by evaluating deep learning models for deer detection, introducing a curated dataset, and benchmarking performance on edge computing platforms.
Details
Motivation: Traditional deer mitigation strategies are inadequate for modern farming due to being labor-intensive, costly, and ineffective, creating a need for intelligent autonomous solutions that require accurate deer detection.Method: The study presents a curated dataset of 3,095 annotated deer images and conducts extensive comparative analysis of 12 model variants across four YOLO architectures (v8, v9, v10, v11), benchmarking performance on high-end GPU and edge computing platforms.
Result: Real-time detection is not feasible on Raspberry Pi without optimization, while NVIDIA Jetson achieves >30 FPS with GPU acceleration. Smaller models like YOLOv11n, YOLOv8s, and YOLOv9s offer optimal balance of high accuracy (AP@.5 > 0.85) and computational efficiency (FPS > 30).
Conclusion: The study provides practical insights for deploying deer detection systems in agriculture, with smaller advanced YOLO models showing the best performance trade-off for real-world applications, and makes both dataset and code publicly available.
Abstract: The escalating economic losses in agriculture due to deer intrusion, estimated to be in the hundreds of millions of dollars annually in the U.S., highlight the inadequacy of traditional mitigation strategies since these methods are often labor-intensive, costly, and ineffective for modern farming systems. To overcome this, there is a critical need for intelligent, autonomous solutions which require accurate and efficient deer detection. But the progress in this field is impeded by a significant gap in the literature, mainly the lack of a domain-specific, practical dataset and limited study on the on-field deployability of deer detection systems. Addressing this gap, this study presents a comprehensive evaluation of state-of-the-art deep learning models for deer detection in challenging real-world scenarios. The contributions of this work are threefold. First, we introduce a curated, publicly available dataset of 3,095 annotated images with bounding-box annotations of deer, derived from the Idaho Cameratraps project. Second, we provide an extensive comparative analysis of 12 model variants across four recent YOLO architectures(v8, v9, v10, and v11). Finally, we benchmarked performance on a high-end NVIDIA RTX 5090 GPU and evaluated on two representative edge computing platforms: Raspberry Pi 5 and NVIDIA Jetson AGX Xavier. Results show that the real-time detection is not feasible in Raspberry Pi without hardware-specific model optimization, while NVIDIA Jetson provides greater than 30 FPS with GPU-accelerated inference on ’s’ and ’n’ series models. This study also reveals that smaller, architecturally advanced models such as YOLOv11n, YOLOv8s, and YOLOv9s offer the optimal balance of high accuracy (AP@.5 > 0.85) and computational efficiency (FPS > 30). To support further research, both the source code and datasets are publicly available at https://github.com/WinnerBishal/track-the-deer.
[219] Efficient Encoder-Free Pose Conditioning and Pose Control for Virtual Try-On
Qi Li, Shuwen Qiu, Julien Han, Xingzi Xu, Mehmet Saygin Seyfioglu, Kee Kiat Koo, Karim Bouyarmane
Main category: cs.CV
TL;DR: This paper enhances Virtual Try-On (VTON) technology by incorporating pose control through spatial concatenation of pose data without adding parameters, using pose maps for optimal results and introducing mixed-mask training for flexible product integration.
Details
Motivation: The growing demand for VTON technology requires accurate pose control to align products with users' bodies in diverse orientations, but integrating pose conditions without extra parameters or complexity is challenging.Method: The authors build on a baseline VTON model that concatenates reference images without external encoders. They spatially concatenate pose data (pose maps and skeletons) without additional parameters, and introduce mixed-mask training with fine-grained and bounding box masks.
Result: Experiments show that pose stitching with pose maps yields the best results, improving both pose preservation and output realism. The mixed-mask strategy enables flexible product integration across varied poses.
Conclusion: The proposed approach effectively incorporates pose control into VTON models through simple spatial concatenation and mixed-mask training, enhancing realism and flexibility without increasing model complexity.
Abstract: As online shopping continues to grow, the demand for Virtual Try-On (VTON) technology has surged, allowing customers to visualize products on themselves by overlaying product images onto their own photos. An essential yet challenging condition for effective VTON is pose control, which ensures accurate alignment of products with the user’s body while supporting diverse orientations for a more immersive experience. However, incorporating pose conditions into VTON models presents several challenges, including selecting the optimal pose representation, integrating poses without additional parameters, and balancing pose preservation with flexible pose control. In this work, we build upon a baseline VTON model that concatenates the reference image condition without external encoder, control network, or complex attention layers. We investigate methods to incorporate pose control into this pure concatenation paradigm by spatially concatenating pose data, comparing performance using pose maps and skeletons, without adding any additional parameters or module to the baseline model. Our experiments reveal that pose stitching with pose maps yields the best results, enhancing both pose preservation and output realism. Additionally, we introduce a mixed-mask training strategy using fine-grained and bounding box masks, allowing the model to support flexible product integration across varied poses and conditions.
[220] PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video Generation
Chen Wang, Chuhao Chen, Yiming Huang, Zhiyang Dou, Yuan Liu, Jiatao Gu, Lingjie Liu
Main category: cs.CV
TL;DR: PhysCtrl is a physics-grounded image-to-video generation framework that uses physical parameters and force control to create more physically plausible videos than existing methods.
Details
Motivation: Existing video generation models produce photo-realistic videos but lack physical plausibility and 3D controllability, limiting their realism and practical applications.Method: Uses a generative physics network with diffusion model conditioned on physics parameters and forces, trained on 550K synthetic animations. Features spatiotemporal attention blocks for particle interactions and physics-based constraints for plausibility.
Result: Generates realistic physics-grounded motion trajectories that drive image-to-video models to produce high-fidelity, controllable videos superior to existing methods in both visual quality and physical plausibility.
Conclusion: PhysCtrl successfully bridges the gap between visual realism and physical accuracy in video generation, enabling controllable physics-based video synthesis across multiple materials.
Abstract: Existing video generation models excel at producing photo-realistic videos from text or images, but often lack physical plausibility and 3D controllability. To overcome these limitations, we introduce PhysCtrl, a novel framework for physics-grounded image-to-video generation with physical parameters and force control. At its core is a generative physics network that learns the distribution of physical dynamics across four materials (elastic, sand, plasticine, and rigid) via a diffusion model conditioned on physics parameters and applied forces. We represent physical dynamics as 3D point trajectories and train on a large-scale synthetic dataset of 550K animations generated by physics simulators. We enhance the diffusion model with a novel spatiotemporal attention block that emulates particle interactions and incorporates physics-based constraints during training to enforce physical plausibility. Experiments show that PhysCtrl generates realistic, physics-grounded motion trajectories which, when used to drive image-to-video models, yield high-fidelity, controllable videos that outperform existing methods in both visual quality and physical plausibility. Project Page: https://cwchenwang.github.io/physctrl
[221] EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning
Xuan Ju, Tianyu Wang, Yuqian Zhou, He Zhang, Qing Liu, Nanxuan Zhao, Zhifei Zhang, Yijun Li, Yuanhao Cai, Shaoteng Liu, Daniil Pakhomov, Zhe Lin, Soo Ye Kim, Qiang Xu
Main category: cs.CV
TL;DR: EditVerse is a unified framework for image and video generation/editing using a single model that represents text, image, and video as unified token sequences, achieving state-of-the-art performance.
Details
Motivation: Video generation and editing remain fragmented due to architectural limitations and data scarcity, while image tasks have successfully unified. The authors aim to create a unified framework for both modalities.Method: Represent all modalities (text, image, video) as unified token sequences using self-attention for in-context learning and cross-modal transfer. Created a scalable data pipeline with 232K video editing samples combined with large-scale datasets for joint training.
Result: EditVerse achieves state-of-the-art performance, surpassing existing open-source and commercial models, with emergent editing and generation abilities across modalities. Also introduced EditVerseBench benchmark.
Conclusion: The unified framework successfully addresses fragmentation in video editing/generation, demonstrating robust performance and cross-modal capabilities through unified token representation and scalable training data.
Abstract: Recent advances in foundation models highlight a clear trend toward unification and scaling, showing emergent capabilities across diverse domains. While image generation and editing have rapidly transitioned from task-specific to unified frameworks, video generation and editing remain fragmented due to architectural limitations and data scarcity. In this work, we introduce EditVerse, a unified framework for image and video generation and editing within a single model. By representing all modalities, i.e., text, image, and video, as a unified token sequence, EditVerse leverages self-attention to achieve robust in-context learning, natural cross-modal knowledge transfer, and flexible handling of inputs and outputs with arbitrary resolutions and durations. To address the lack of video editing training data, we design a scalable data pipeline that curates 232K video editing samples and combines them with large-scale image and video datasets for joint training. Furthermore, we present EditVerseBench, the first benchmark for instruction-based video editing covering diverse tasks and resolutions. Extensive experiments and user studies demonstrate that EditVerse achieves state-of-the-art performance, surpassing existing open-source and commercial models, while exhibiting emergent editing and generation abilities across modalities.
[222] An Optimized PatchMatch for Multi-scale and Multi-feature Label Fusion
Rémi Giraud, Vinh-Thong Ta, Nicolas Papadakis, José V. Manjón, D. Louis Collins, Pierrick Coupé, Alzheimer’s Disease Neuroimaging Initiative
Main category: cs.CV
TL;DR: OPAL is an optimized patch-based label fusion framework that drastically reduces computation time for MRI segmentation while achieving state-of-the-art accuracy comparable to inter-expert variability.
Details
Motivation: To develop a more efficient patch-based label fusion method for MRI segmentation that reduces computation time while maintaining high accuracy, enabling processing on large databases and opening new strategies.Method: Optimized PAtchMatch Label fusion (OPAL) strategy with multi-scale and multi-feature framework for searching similar patches in MRI segmentation.
Result: OPAL achieved highest median Dice coefficients (89.9% for ICBM, 90.1% for EADC-ADNI) and segmentation accuracy similar to inter-expert variability. Hippocampal volumes from automatic and manual segmentation were highly correlated.
Conclusion: OPAL provides efficient and accurate MRI segmentation, enabling more accurate separation of pathological populations through highly correlated volume measurements compared to manual segmentation.
Abstract: Automatic segmentation methods are important tools for quantitative analysis of Magnetic Resonance Images (MRI). Recently, patch-based label fusion approaches have demonstrated state-of-the-art segmentation accuracy. In this paper, we introduce a new patch-based label fusion framework to perform segmentation of anatomical structures. The proposed approach uses an Optimized PAtchMatch Label fusion (OPAL) strategy that drastically reduces the computation time required for the search of similar patches. The reduced computation time of OPAL opens the way for new strategies and facilitates processing on large databases. In this paper, we investigate new perspectives offered by OPAL, by introducing a new multi-scale and multi-feature framework. During our validation on hippocampus segmentation we use two datasets: young adults in the ICBM cohort and elderly adults in the EADC-ADNI dataset. For both, OPAL is compared to state-of-the-art methods. Results show that OPAL obtained the highest median Dice coefficient (89.9% for ICBM and 90.1% for EADC-ADNI). Moreover, in both cases, OPAL produced a segmentation accuracy similar to inter-expert variability. On the EADC-ADNI dataset, we compare the hippocampal volumes obtained by manual and automatic segmentation. The volumes appear to be highly correlated that enables to perform more accurate separation of pathological populations.
[223] Robust superpixels using color and contour features along linear path
Rémi Giraud, Vinh-Thong Ta, Nicolas Papadakis
Main category: cs.CV
TL;DR: SCALP proposes a superpixel decomposition method that jointly enforces color homogeneity, contour adherence, and shape regularity by considering color features along linear paths between pixels and superpixel barycenters, with contour priors and neighborhood integration.
Details
Motivation: Existing superpixel methods face trade-offs between color homogeneity, contour adherence, and shape regularity. The authors aim to develop a framework that simultaneously optimizes all three aspects to produce more accurate and regular superpixels.Method: SCALP considers color features along linear paths between pixels and superpixel barycenters, uses contour priors to prevent boundary crossing, and integrates pixel neighborhood information while maintaining computational efficiency.
Result: Extensive evaluation on standard segmentation datasets shows SCALP outperforms state-of-the-art methods. The method is also successfully extended to supervoxel decomposition on MRI images.
Conclusion: SCALP provides an effective framework for superpixel decomposition that achieves superior performance by jointly optimizing all key aspects (homogeneity, contour adherence, regularity) and demonstrates robustness through successful extension to 3D supervoxel applications.
Abstract: Superpixel decomposition methods are widely used in computer vision and image processing applications. By grouping homogeneous pixels, the accuracy can be increased and the decrease of the number of elements to process can drastically reduce the computational burden. For most superpixel methods, a trade-off is computed between 1) color homogeneity, 2) adherence to the image contours and 3) shape regularity of the decomposition. In this paper, we propose a framework that jointly enforces all these aspects and provides accurate and regular Superpixels with Contour Adherence using Linear Path (SCALP). During the decomposition, we propose to consider color features along the linear path between the pixel and the corresponding superpixel barycenter. A contour prior is also used to prevent the crossing of image boundaries when associating a pixel to a superpixel. Finally, in order to improve the decomposition accuracy and the robustness to noise, we propose to integrate the pixel neighborhood information, while preserving the same computational complexity. SCALP is extensively evaluated on standard segmentation dataset, and the obtained results outperform the ones of the state-of-the-art methods. SCALP is also extended for supervoxel decomposition on MRI images.
[224] Texture Superpixel Clustering from Patch-based Nearest Neighbor Matching
Rémi Giraud, Yannick Berthoumieu
Main category: cs.CV
TL;DR: Proposes NNSC - a texture-aware superpixel clustering method using patch-based nearest neighbor matching that outperforms existing methods in segmentation performance and computational efficiency.
Details
Motivation: Existing superpixel decomposition methods often fail to efficiently cluster image pixels according to local texture, creating a need for more effective texture-aware approaches.Method: Introduces a new clustering framework using patch-based nearest neighbor matching instead of traditional pixel-wise K-means clustering, directly grouping pixels in patch space to capture texture information.
Result: Demonstrates favorable segmentation performance on standard color and texture datasets, and shows superior computational efficiency compared to recent texture-aware superpixel methods.
Conclusion: NNSC provides an effective texture-aware superpixel clustering solution that balances segmentation quality with computational efficiency.
Abstract: Superpixels are widely used in computer vision applications. Nevertheless, decomposition methods may still fail to efficiently cluster image pixels according to their local texture. In this paper, we propose a new Nearest Neighbor-based Superpixel Clustering (NNSC) method to generate texture-aware superpixels in a limited computational time compared to previous approaches. We introduce a new clustering framework using patch-based nearest neighbor matching, while most existing methods are based on a pixel-wise K-means clustering. Therefore, we directly group pixels in the patch space enabling to capture texture information. We demonstrate the efficiency of our method with favorable comparison in terms of segmentation performances on both standard color and texture datasets. We also show the computational efficiency of NNSC compared to recent texture-aware superpixel methods.
[225] Multi-Scale Superpatch Matching using Dual Superpixel Descriptors
Rémi Giraud, Merlin Boyer, Michaël Clément
Main category: cs.CV
TL;DR: The paper introduces a novel superpixel neighborhood descriptor called dual superpatch that captures both region features and contour information at superpixel borders to improve pattern matching accuracy.
Details
Motivation: Existing superpixel neighborhood descriptors are sub-optimal because they only compute features within each region, poorly capturing contour information at superpixel borders, which limits their ability to provide robust and accurate descriptors for similar pattern matching.Method: The proposed dual superpatch structure contains features computed in reduced superpixel regions as well as at the interfaces of multiple superpixels to explicitly capture contour structure information. A fast multi-scale non-local matching framework is also introduced for searching similar descriptors at different resolution levels.
Result: The dual superpatch enables more accurate capture of similar structured patterns at different scales, demonstrating robustness and performance improvements in matching and supervised labeling applications.
Conclusion: The dual superpatch descriptor effectively addresses the limitations of existing superpixel neighborhood descriptors by incorporating both region and contour information, leading to improved pattern matching accuracy and robustness in image processing applications.
Abstract: Over-segmentation into superpixels is a very effective dimensionality reduction strategy, enabling fast dense image processing. The main issue of this approach is the inherent irregularity of the image decomposition compared to standard hierarchical multi-resolution schemes, especially when searching for similar neighboring patterns. Several works have attempted to overcome this issue by taking into account the region irregularity into their comparison model. Nevertheless, they remain sub-optimal to provide robust and accurate superpixel neighborhood descriptors, since they only compute features within each region, poorly capturing contour information at superpixel borders. In this work, we address these limitations by introducing the dual superpatch, a novel superpixel neighborhood descriptor. This structure contains features computed in reduced superpixel regions, as well as at the interfaces of multiple superpixels to explicitly capture contour structure information. A fast multi-scale non-local matching framework is also introduced for the search of similar descriptors at different resolution levels in an image dataset. The proposed dual superpatch enables to more accurately capture similar structured patterns at different scales, and we demonstrate the robustness and performance of this new strategy on matching and supervised labeling applications.
[226] Ctrl-Room: Controllable Text-to-3D Room Meshes Generation with Layout Constraints
Chuan Fang, Yuan Dong, Kunming Luo, Xiaotao Hu, Rakesh Shrestha, Ping Tan
Main category: cs.CV
TL;DR: Ctrl-Room is a text-driven 3D indoor scene generation method that separates layout and appearance modeling, enabling high-quality room generation with flexible editing capabilities.
Details
Motivation: Existing methods cannot faithfully capture room layouts or allow flexible editing of individual objects. Ctrl-Room addresses these limitations to generate convincing 3D rooms with designer-style layouts from text prompts.Method: Two-stage approach: 1) Layout Generation Stage using text-conditional diffusion model with holistic scene code parameterization, 2) Appearance Generation Stage using fine-tuned ControlNet to produce panoramic images guided by 3D layout and text.
Result: Outperforms existing methods on Structured3D dataset, producing more reasonable, view-consistent, and editable 3D rooms. Enables easy editing through mask-guided editing module without expensive edit-specific training.
Conclusion: Ctrl-Room achieves high-quality 3D room generation with convincing layouts and lively textures, enabling versatile interactive editing operations for individual furniture items.
Abstract: Text-driven 3D indoor scene generation is useful for gaming, the film industry, and AR/VR applications. However, existing methods cannot faithfully capture the room layout, nor do they allow flexible editing of individual objects in the room. To address these problems, we present Ctrl-Room, which can generate convincing 3D rooms with designer-style layouts and high-fidelity textures from just a text prompt. Moreover, Ctrl-Room enables versatile interactive editing operations such as resizing or moving individual furniture items. Our key insight is to separate the modeling of layouts and appearance. Our proposed method consists of two stages: a Layout Generation Stage and an Appearance Generation Stage. The Layout Generation Stage trains a text-conditional diffusion model to learn the layout distribution with our holistic scene code parameterization. Next, the Appearance Generation Stage employs a fine-tuned ControlNet to produce a vivid panoramic image of the room guided by the 3D scene layout and text prompt. We thus achieve a high-quality 3D room generation with convincing layouts and lively textures. Benefiting from the scene code parameterization, we can easily edit the generated room model through our mask-guided editing module, without expensive edit-specific training. Extensive experiments on the Structured3D dataset demonstrate that our method outperforms existing methods in producing more reasonable, view-consistent, and editable 3D rooms from natural language prompts.
[227] Long Video Understanding with Learnable Retrieval in Video-Language Models
Jiaqi Xu, Cuiling Lan, Wenxuan Xie, Xuejin Chen, Yan Lu
Main category: cs.CV
TL;DR: R-VLM is a learnable retrieval-based video-language model that efficiently handles long video understanding by selecting relevant video chunks to reduce token count and eliminate noise.
Details
Motivation: Large language models face challenges with long videos due to high computational costs from excessive video tokens, loss of visual details from token aggregation, and noise from irrelevant tokens.Method: The model uses a learnable lightweight MLP block to retrieve the most relevant K video chunks based on a question query, then uses their visual tokens as context for LLM inference with end-to-end training and soft matching loss.
Result: Experimental results on multiple zero-shot video question answering datasets validate the framework’s effectiveness for long video comprehension.
Conclusion: R-VLM provides an efficient solution for long video understanding by reducing token count, eliminating noise, and enhancing performance through selective chunk retrieval.
Abstract: The remarkable natural language understanding, reasoning, and generation capabilities of large language models (LLMs) have made them attractive for application to video understanding, utilizing video tokens as contextual input. However, employing LLMs for long video understanding presents significant challenges. The extensive number of video tokens leads to considerable computational costs for LLMs while using aggregated tokens results in loss of vision details. Moreover, the presence of abundant question-irrelevant tokens introduces noise to the video reasoning process. To address these issues, we introduce a simple yet effective learnable retrieval-based video-language model (R-VLM) for efficient long video understanding. Specifically, given a question (query) and a long video, our model identifies and selects the most relevant K video chunks and uses their associated visual tokens to serve as context for the LLM inference. This effectively reduces the number of video tokens, eliminates noise interference, and enhances system performance. We achieve this by incorporating a learnable lightweight MLP block to facilitate the efficient retrieval of question-relevant chunks, through the end-to-end training of our video-language model with a proposed soft matching loss. Our experimental results on multiple zero-shot video question answering datasets validate the effectiveness of our framework for comprehending long videos.
[228] CLIP Can Understand Depth
Sohee Kim, Jisu Kang, Dunam Kim, Seokju Lee
Main category: cs.CV
TL;DR: This paper introduces ‘mirror’ - a learnable embedding matrix that distills CLIP’s semantic prior to enable monocular depth estimation without fine-tuning, achieving state-of-the-art performance while being parameter-efficient.
Details
Motivation: CLIP's vision-language alignment struggles with depth estimation tasks where it fails to capture similarities between image patches and natural language prompts describing distance, despite its success in other domains.Method: Eliminate CLIP’s pre-trained natural language token embeddings and distill the semantic prior into a single learnable ‘mirror’ embedding matrix. Jointly train mirror and a compact decoder on frozen CLIP for dense depth prediction.
Result: The model matches state-of-the-art vision models on NYU Depth v2 and KITTI benchmarks while outperforming all vision-language depth models based on frozen CLIP, with significantly better parameter and computational efficiency.
Conclusion: CLIP’s suboptimal depth understanding can be corrected without fine-tuning by using mirror embeddings, which implicitly learn to capture semantic cues important for depth estimation.
Abstract: In this paper, we demonstrate that CLIP can also be adapted to downstream tasks where its vision-language alignment is suboptimally learned during pre-training on web-crawled data, all without requiring fine-tuning. We explore the case of monocular depth estimation, where CLIP’s contrastive prior struggles to generalize, compared to its success in domains such as generative modeling and semantic segmentation. Since CLIP fails to consistently capture similarities between image patches and natural language prompts describing distance, we eliminate the use of its pre-trained natural language token embeddings and distill the semantic prior of its frozen text encoder into a single learnable embedding matrix called “mirror”. The main design goal of mirror is to derive a non-human language prompt that approximates an optimal natural language prompt: “How far is this location from the camera?” Using this approach, we jointly train two lightweight modules, a mirror and a compact decoder, on top of a frozen CLIP for dense depth prediction. Compared to conventional depth models, our framework is significantly more efficient in terms of parameters and computation. The resulting model exhibits impressive performance, matching several state-of-the-art vision models on the NYU Depth v2 and KITTI benchmark datasets, while outperforming all vision-language depth models based on a frozen CLIP prior. Experiments demonstrate that the suboptimal depth understanding of CLIP in terms of spatial and temporal consistency can be significantly corrected without either fine-tuning it or concatenating mirror with its pre-trained subword token embeddings. Furthermore, an ablation study on the convergence status of mirror shows that it is implicitly trained to capture objects, such as humans and windows, where semantic cues play an important role in detection.
[229] MCPDepth: Omnidirectional Depth Estimation via Stereo Matching from Multi-Cylindrical Panoramas
Feng Qiao, Zhexiao Xiong, Xinge Zhu, Yuexin Ma, Qiumeng He, Nathan Jacobs
Main category: cs.CV
TL;DR: MCPDepth is a novel two-stage framework for omnidirectional depth estimation that uses stereo matching across multiple cylindrical panoramas, achieving significant performance improvements over existing methods.
Details
Motivation: Omnidirectional depth estimation is challenging due to distortions in panoramic images, and the impact of projection methods remains underexplored. Existing methods rely on customized kernels to handle distortions, which limits deployment on embedded devices.Method: A two-stage framework: 1) stereo matching using cylindrical panoramas, 2) robust fusion of depth maps from different views. Uses standard network components with a circular attention module to address vertical distortions and expand receptive field beyond traditional convolutions.
Result: Improves MAE by 18.8% on Deep360 outdoor dataset and 19.9% on 3D60 real dataset. Comprehensive analysis shows cylindrical projection is superior to spherical and cubic projections.
Conclusion: MCPDepth establishes a new paradigm in omnidirectional depth estimation, offering practical insights for real-world applications and seamless deployment on embedded devices. The method demonstrates superior efficacy of cylindrical projection.
Abstract: Omnidirectional depth estimation presents a significant challenge due to the inherent distortions in panoramic images. Despite notable advancements, the impact of projection methods remains underexplored. We introduce Multi-Cylindrical Panoramic Depth Estimation (MCPDepth), a novel two-stage framework designed to enhance omnidirectional depth estimation through stereo matching across multiple cylindrical panoramas. MCPDepth initially performs stereo matching using cylindrical panoramas, followed by a robust fusion of the resulting depth maps from different views. Unlike existing methods that rely on customized kernels to address distortions, MCPDepth utilizes standard network components, facilitating seamless deployment on embedded devices while delivering exceptional performance. To effectively address vertical distortions in cylindrical panoramas, MCPDepth incorporates a circular attention module, significantly expanding the receptive field beyond traditional convolutions. We provide a comprehensive theoretical and experimental analysis of common panoramic projections-spherical, cylindrical, and cubic-demonstrating the superior efficacy of cylindrical projection. Our method improves the mean absolute error (MAE) by 18.8% on the outdoor dataset Deep360 and by 19.9% on the real dataset 3D60. This work offers practical insights for other tasks and real-world applications, establishing a new paradigm in omnidirectional depth estimation. The code is available at https://github.com/Qjizhi/MCPDepth.
[230] Positional Prompt Tuning for Efficient 3D Representation Learning
Shaochen Zhang, Zekun Qi, Runpei Dong, Xiuxiu Bai, Xing Wei
Main category: cs.CV
TL;DR: PPT is a parameter-efficient fine-tuning method for point cloud analysis that uses increased patch tokens and trainable positional encoding while keeping most pre-trained model parameters frozen, achieving state-of-the-art results with only 1.05M trainable parameters.
Details
Motivation: To rethink the role of positional encoding in 3D representation learning and explore parameter-efficient fine-tuning through prompts and adapters for point cloud analysis.Method: PPT incorporates increased patch tokens and trainable positional encoding while freezing most pre-trained model parameters. It uses positional encoding to aggregate multi-scale features of point clouds.
Result: Achieves state-of-the-art results including 95.01% accuracy on ScanObjectNN OBJ_BG dataset with only 1.05M parameters for training.
Conclusion: PPT is an effective and efficient parameter-efficient fine-tuning method for point cloud analysis that demonstrates superior performance while requiring minimal trainable parameters.
Abstract: We rethink the role of positional encoding in 3D representation learning and fine-tuning. We argue that using positional encoding in point Transformer-based methods serves to aggregate multi-scale features of point clouds. Additionally, we explore parameter-efficient fine-tuning (PEFT) through the lens of prompts and adapters, introducing a straightforward yet effective method called PPT for point cloud analysis. PPT incorporates increased patch tokens and trainable positional encoding while keeping most pre-trained model parameters frozen. Extensive experiments validate that PPT is both effective and efficient. Our proposed method of PEFT tasks, namely PPT, with only 1.05M of parameters for training, gets state-of-the-art results in several mainstream datasets, such as 95.01% accuracy in the ScanObjectNN OBJ_BG dataset. Codes and weights will be released at https://github.com/zsc000722/PPT.
[231] Lagrangian Motion Fields for Long-term Motion Generation
Yifei Yang, Zikai Huang, Chenshu Xu, Shengfeng He
Main category: cs.CV
TL;DR: The paper introduces Lagrangian Motion Fields, a novel approach for long-term motion generation that treats joints as Lagrangian particles with uniform velocity, creating condensed ‘supermotions’ to overcome limitations of framewise representations.
Details
Motivation: Current motion generation methods rely on framewise representations that capture only static spatial details and overlook temporal dynamics, leading to redundancy and difficulty in generating effective long-term motion sequences.Method: Proposes Lagrangian Motion Fields where each joint is treated as a Lagrangian particle with uniform velocity over short intervals, condensing motion into ‘supermotions’ that integrate spatial information with temporal dynamics without requiring neural network preprocessing.
Result: The approach excels in long-term music-to-dance and text-to-motion generation, offering enhanced efficiency, superior generation quality, and greater diversity compared to existing methods. It also enables applications like infinite motion looping and fine-grained controlled motion generation.
Conclusion: Lagrangian Motion Fields provide a versatile and lightweight solution for long-term motion generation that transcends limitations of existing architectures and motion content types, demonstrating broad utility across various applications.
Abstract: Long-term motion generation is a challenging task that requires producing coherent and realistic sequences over extended durations. Current methods primarily rely on framewise motion representations, which capture only static spatial details and overlook temporal dynamics. This approach leads to significant redundancy across the temporal dimension, complicating the generation of effective long-term motion. To overcome these limitations, we introduce the novel concept of Lagrangian Motion Fields, specifically designed for long-term motion generation. By treating each joint as a Lagrangian particle with uniform velocity over short intervals, our approach condenses motion representations into a series of “supermotions” (analogous to superpixels). This method seamlessly integrates static spatial information with interpretable temporal dynamics, transcending the limitations of existing network architectures and motion sequence content types. Our solution is versatile and lightweight, eliminating the need for neural network preprocessing. Our approach excels in tasks such as long-term music-to-dance generation and text-to-motion generation, offering enhanced efficiency, superior generation quality, and greater diversity compared to existing methods. Additionally, the adaptability of Lagrangian Motion Fields extends to applications like infinite motion looping and fine-grained controlled motion generation, highlighting its broad utility. Video demonstrations are available at https://plyfager.github.io/LaMoG.
[232] Diffusion Curriculum: Synthetic-to-Real Generative Curriculum Learning via Image-Guided Diffusion
Yijun Liang, Shweta Bhardwaj, Tianyi Zhou
Main category: cs.CV
TL;DR: DisCL is a novel Diffusion Curriculum framework that uses image-guided diffusion models to generate synthetic data at different guidance levels, enabling adaptive training that focuses on hard samples to improve performance on long-tail classification and low-quality data learning tasks.
Details
Motivation: Low-quality or scarce data poses challenges for training deep neural networks. Text-only guidance in diffusion models cannot control synthetic images' proximity to original images, leading to out-of-distribution data that harms model performance.Method: DisCL uses image guidance to create a spectrum of interpolations between synthetic and real images. It adjusts image guidance levels during training, focusing on hard samples and assessing optimal guidance levels for synthetic images to improve learning of difficult data.
Result: DisCL achieved 2.7% and 2.1% gains in OOD and ID macro-accuracy on iWildCam dataset. On ImageNet-LT, it improved tail-class accuracy from 4.4% to 23.64% and all-class accuracy by 4.02%.
Conclusion: The Diffusion Curriculum framework effectively addresses data scarcity and quality issues by adaptively generating and utilizing synthetic data at optimal guidance levels, significantly improving model performance on challenging tasks.
Abstract: Low-quality or scarce data has posed significant challenges for training deep neural networks in practice. While classical data augmentation cannot contribute very different new data, diffusion models opens up a new door to build self-evolving AI by generating high-quality and diverse synthetic data through text-guided prompts. However, text-only guidance cannot control synthetic images’ proximity to the original images, resulting in out-of-distribution data detrimental to the model performance. To overcome the limitation, we study image guidance to achieve a spectrum of interpolations between synthetic and real images. With stronger image guidance, the generated images are similar to the training data but hard to learn. While with weaker image guidance, the synthetic images will be easier for model but contribute to a larger distribution gap with the original data. The generated full spectrum of data enables us to build a novel “Diffusion Curriculum (DisCL)”. DisCL adjusts the image guidance level of image synthesis for each training stage: It identifies and focuses on hard samples for the model and assesses the most effective guidance level of synthetic images to improve hard data learning. We apply DisCL to two challenging tasks: long-tail (LT) classification and learning from low-quality data. It focuses on lower-guidance images of high-quality to learn prototypical features as a warm-up of learning higher-guidance images that might be weak on diversity or quality. Extensive experiments showcase a gain of 2.7% and 2.1% in OOD and ID macro-accuracy when applying DisCL to iWildCam dataset. On ImageNet-LT, DisCL improves the base model’s tail-class accuracy from 4.4% to 23.64% and leads to a 4.02% improvement in all-class accuracy.
[233] Replay-Free Continual Low-Rank Adaptation with Dynamic Memory
Huancheng Chen, Jingtao Li, Weiming Zhuang, Chen Chen, Lingjuan Lyu
Main category: cs.CV
TL;DR: DualLoRA: A novel parameter-efficient fine-tuning method for continual learning that uses dual low-rank adapters with dynamic memory to balance stability and plasticity in vision transformers.
Details
Motivation: Address catastrophic forgetting in continual learning for large-scale vision transformers by bridging the gap between parameter-efficient fine-tuning (PEFT) and continual learning, as LoRA techniques have been under-explored in CL contexts.Method: Proposes Dual Low-Rank Adaptation (DualLoRA) with orthogonal and residual LoRA adapters parallel to pre-trained weights, orchestrated by dynamic memory mechanism. Includes task identity prediction with confidence calibration.
Result: DualLoRA achieves significant advantages in accuracy, inference speed, and computation efficiency over existing CL methods across multiple benchmarks on ViT-based models.
Conclusion: The proposed DualLoRA method effectively addresses catastrophic forgetting in continual learning for vision transformers through efficient parameter adaptation and dynamic memory management.
Abstract: We revisit continual learning~(CL), which enables pre-trained vision transformers (ViTs) to sequentially fine-tune on new downstream tasks over time. However, as the scale of these models increases, catastrophic forgetting remains a more serious challenge. Recent studies highlight a crossover between CL techniques and parameter-efficient fine-tuning (PEFT), which focuses on fine-tuning only a small set of trainable parameters to adapt to downstream tasks, such as low-rank adaptation (LoRA). While LoRA achieves faster convergence and requires fewer trainable parameters, it has seldom been explored in the context of continual learning. To address this gap, we propose a novel PEFT-CL method called Dual Low-Rank Adaptation (DualLoRA), which introduces both an orthogonal LoRA adapter and a residual LoRA adapter parallel to pre-trained weights in each layer. These components are orchestrated by a dynamic memory mechanism to strike a balance between stability and plasticity. Additionally, we propose a scheme to predict task identity with confidence and calibrate the model’s outputs accordingly. On ViT-based models, we demonstrate that DualLoRA offers significant advantages in accuracy, inference speed, and computation efficiency in training over existing CL methods across multiple benchmarks.
[234] SMLNet: A SPD Manifold Learning Network for Infrared and Visible Image Fusion
Huan Kang, Hui Li, Tianyang Xu, Xiao-Jun Wu, Rui Wang, Chunyang Cheng, Josef Kittler
Main category: cs.CV
TL;DR: A novel SPD manifold learning method called SMLNet is proposed for multi-modal image fusion, extending fusion from Euclidean space to SPD manifolds to better handle non-Euclidean data structures and align with human visual perception.
Details
Motivation: Traditional Euclidean representation learning struggles with real-world non-Euclidean data structures, and evaluating latent representation consistency using Euclidean distance is challenging for multi-modal image fusion tasks.Method: The method encodes images using Riemannian geometry to exploit intrinsic statistical correlations, employs cross-modal fusion strategy, develops an attention module for semantic affinity processing, and designs an end-to-end fusion network based on SPD manifold learning.
Result: Extensive experiments on public datasets demonstrate superior performance compared to state-of-the-art methods.
Conclusion: The proposed SMLNet framework effectively addresses the limitations of Euclidean methods by leveraging SPD manifold learning for multi-modal image fusion, achieving better performance through Riemannian geometry-based encoding and cross-modal semantic processing.
Abstract: Euclidean representation learning methods have achieved promising results in image fusion tasks, which can be attributed to their clear advantages in handling with linear space. However, data collected from a realistic scene usually has a non-Euclidean structure, evaluating the consistency of latent representations from paired views using Euclidean distance raises challenges. To address this issue, a novel SPD (symmetric positive definite) manifold learning is proposed for multi-modal image fusion, named SMLNet, which extends the image fusion approach from the Euclidean space to the SPD manifolds. Specifically, we encode images according to the Riemannian geometry to exploit their intrinsic statistical correlations, thereby aligning with human visual perception. The SPD matrix fundamentally underpins our network’s learning process. Building upon this mathematical foundation, we employ a cross-modal fusion strategy to exploit modality-specific dependencies and augment complementary information. To capture semantic similarity in images’ intrinsic space, we further develop an attention module that meticulously processes the cross-modal semantic affinity matrix. Based on this, we design an end-to-end fusion network based on cross-modal manifold learning. Extensive experiments on public datasets demonstrate that our framework exhibits superior performance compared to the current state-of-the-art methods. Our code will be publicly available at https://github.com/Shaoyun2023.
[235] DreamMix: Decoupling Object Attributes for Enhanced Editability in Customized Image Inpainting
Yicheng Yang, Pengxiang Li, Lu Zhang, Liqian Ma, Ping Hu, Siyu Du, Yunzhi Zhuge, Xu Jia, Huchuan Lu
Main category: cs.CV
TL;DR: DreamMix is a diffusion-based framework for subject-driven image inpainting that addresses identity overfitting by enabling attribute modifications while preserving object identity through attribute decoupling and disentangled generation.
Details
Motivation: Current subject-driven image inpainting methods suffer from identity overfitting, where original attributes remain entangled with target textual instructions, limiting effective attribute editing.Method: DreamMix introduces three components: Attribute Decoupling Mechanism (ADM) for diverse attribute-augmented training, Textual Attribute Substitution (TAS) for attribute isolation via orthogonal decomposition, and Disentangled Inpainting Framework (DIF) that separates local generation from global harmonization.
Result: Extensive experiments show DreamMix achieves superior balance between identity preservation and attribute editability across object insertion, attribute editing, and small object inpainting applications.
Conclusion: DreamMix effectively overcomes identity overfitting limitations in subject-driven image inpainting, enabling high-quality attribute modifications while maintaining object identity across diverse inpainting tasks.
Abstract: Subject-driven image inpainting has recently gained prominence in image editing with the rapid advancement of diffusion models. Beyond image guidance, recent studies have explored incorporating text guidance to achieve identity-preserved yet locally editable object inpainting. However, these methods still suffer from identity overfitting, where original attributes remain entangled with target textual instructions. To overcome this limitation, we propose DreamMix, a diffusion-based framework adept at inserting target objects into user-specified regions while concurrently enabling arbitrary text-driven attribute modifications. DreamMix introduces three key components: (i) an Attribute Decoupling Mechanism (ADM) that synthesizes diverse attribute-augmented image-text pairs to mitigate overfitting; (ii) a Textual Attribute Substitution (TAS) module that isolates target attributes via orthogonal decomposition, and (iii) a Disentangled Inpainting Framework (DIF) that seperates local generation from global harmonization. Extensive experiments across multiple inpainting backbones demonstrate that DreamMix achieves a superior balance between identity preservation and attribute editability across diverse applications, including object insertion, attribute editing, and small object inpainting.
[236] SpaRC: Sparse Radar-Camera Fusion for 3D Object Detection
Philipp Wolters, Johannes Gilg, Torben Teepe, Fabian Herzog, Felix Fent, Gerhard Rigoll
Main category: cs.CV
TL;DR: SpaRC is a sparse fusion transformer for 3D perception that integrates multi-view image semantics with Radar and Camera point features, achieving state-of-the-art performance in autonomous driving perception.
Details
Motivation: Current query-based transformers for camera-only detection suffer from false positive detections and poor localization precision due to implicit depth modeling. Conventional dense BEV-based approaches are computationally intensive.Method: Uses three key techniques: sparse frustum fusion (SFF) for cross-modal feature alignment, range-adaptive radar aggregation (RAR) for precise object localization, and local self-attention (LSA) for focused query aggregation. Operates directly on encoded point features instead of dense BEV-grid rendering.
Result: Achieves state-of-the-art performance with 67.1 NDS and 63.1 AMOTA on nuScenes and TruckScenes benchmarks, significantly outperforming existing dense BEV-based and sparse query-based detectors.
Conclusion: SpaRC provides substantial improvements in both efficiency and accuracy for 3D perception in autonomous driving by effectively fusing radar and camera modalities through sparse fusion techniques.
Abstract: In this work, we present SpaRC, a novel Sparse fusion transformer for 3D perception that integrates multi-view image semantics with Radar and Camera point features. The fusion of radar and camera modalities has emerged as an efficient perception paradigm for autonomous driving systems. While conventional approaches utilize dense Bird’s Eye View (BEV)-based architectures for depth estimation, contemporary query-based transformers excel in camera-only detection through object-centric methodology. However, these query-based approaches exhibit limitations in false positive detections and localization precision due to implicit depth modeling. We address these challenges through three key contributions: (1) sparse frustum fusion (SFF) for cross-modal feature alignment, (2) range-adaptive radar aggregation (RAR) for precise object localization, and (3) local self-attention (LSA) for focused query aggregation. In contrast to existing methods requiring computationally intensive BEV-grid rendering, SpaRC operates directly on encoded point features, yielding substantial improvements in efficiency and accuracy. Empirical evaluations on the nuScenes and TruckScenes benchmarks demonstrate that SpaRC significantly outperforms existing dense BEV-based and sparse query-based detectors. Our method achieves state-of-the-art performance metrics of 67.1 NDS and 63.1 AMOTA. The code and pretrained models are available at https://github.com/phi-wol/sparc.
[237] From Slow Bidirectional to Fast Autoregressive Video Diffusion Models
Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, Fredo Durand, Eli Shechtman, Xun Huang
Main category: cs.CV
TL;DR: The paper presents a method to convert bidirectional video diffusion models into autoregressive transformers for real-time streaming generation, using distillation techniques to reduce latency while maintaining quality.
Details
Motivation: Current video diffusion models struggle with interactive applications due to bidirectional attention dependencies that require processing entire sequences including future frames, causing latency issues.Method: Adapts pretrained bidirectional diffusion transformers to autoregressive transformers, extends distribution matching distillation (DMD) to videos with student initialization from teacher’s ODE trajectories and asymmetric distillation strategy.
Result: Achieves 84.27 on VBench-Long benchmark (surpassing previous models), enables 9.4 FPS streaming generation on single GPU with KV caching, and supports zero-shot streaming video-to-video translation, image-to-video, and dynamic prompting.
Conclusion: The approach effectively addresses latency limitations of video diffusion models while maintaining generation quality, enabling practical interactive video generation applications.
Abstract: Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies. The generation of a single frame requires the model to process the entire sequence, including the future. We address this limitation by adapting a pretrained bidirectional diffusion transformer to an autoregressive transformer that generates frames on-the-fly. To further reduce latency, we extend distribution matching distillation (DMD) to videos, distilling 50-step diffusion model into a 4-step generator. To enable stable and high-quality distillation, we introduce a student initialization scheme based on teacher’s ODE trajectories, as well as an asymmetric distillation strategy that supervises a causal student model with a bidirectional teacher. This approach effectively mitigates error accumulation in autoregressive generation, allowing long-duration video synthesis despite training on short clips. Our model achieves a total score of 84.27 on the VBench-Long benchmark, surpassing all previous video generation models. It enables fast streaming generation of high-quality videos at 9.4 FPS on a single GPU thanks to KV caching. Our approach also enables streaming video-to-video translation, image-to-video, and dynamic prompting in a zero-shot manner.
[238] SafeEraser: Enhancing Safety in Multimodal Large Language Models through Multimodal Machine Unlearning
Junkai Chen, Zhijie Deng, Kening Zheng, Yibo Yan, Shuliang Liu, PeiJun Wu, Peijie Jiang, Jia Liu, Xuming Hu
Main category: cs.CV
TL;DR: SAFEERASER is a safety unlearning benchmark for Multimodal Large Language Models (MLLMs) that addresses security issues by enabling selective forgetting of harmful content while maintaining model performance.
Details
Motivation: As MLLMs develop, their security vulnerabilities become more prominent. Machine Unlearning (MU) has been used for privacy protection but hasn't been fully explored for safety purposes in MLLMs, creating a research gap.Method: The authors propose SAFEERASER benchmark with 3,000 images and 28.8K VQA pairs, and introduce Prompt Decouple (PD) Loss to alleviate over-forgetting during unlearning. They also propose Safe Answer Refusal Rate (SARR) metric to measure over-forgetting.
Result: Existing MU methods struggle with maintaining performance and suffer from over-forgetting. PD Loss combined with existing methods reduces SARR metric by 79.5% in LLaVA-7B and LLaVA-13B while maintaining forget quality and model utility.
Conclusion: The proposed PD Loss effectively prevents over-forgetting in safety unlearning for MLLMs, and SAFEERASER provides a comprehensive benchmark for evaluating safety unlearning methods.
Abstract: As Multimodal Large Language Models (MLLMs) develop, their potential security issues have become increasingly prominent. Machine Unlearning (MU), as an effective strategy for forgetting specific knowledge in training data, has been widely used in privacy protection. However, MU for safety in MLLM has yet to be fully explored. To address this issue, we propose SAFEERASER, a safety unlearning benchmark for MLLMs, consisting of 3,000 images and 28.8K VQA pairs. We comprehensively evaluate unlearning methods from two perspectives: forget quality and model utility. Our findings show that existing MU methods struggle to maintain model performance while implementing the forget operation and often suffer from over-forgetting. Hence, we introduce Prompt Decouple (PD) Loss to alleviate over-forgetting through decouple prompt during unlearning process. To quantitatively measure over-forgetting mitigated by PD Loss, we propose a new metric called Safe Answer Refusal Rate (SARR). Experimental results demonstrate that combining PD Loss with existing unlearning methods can effectively prevent over-forgetting and achieve a decrease of 79.5% in the SARR metric of LLaVA-7B and LLaVA-13B, while maintaining forget quality and model utility. Our code and dataset will be released upon acceptance. Warning: This paper contains examples of harmful language and images, and reader discretion is recommended.
[239] Robust Computer-Vision based Construction Site Detection for Assistive-Technology Applications
Junchi Feng, Giles Hamilton-Fletcher, Nikhil Ballem, Michael Batavia, Yifei Wang, Jiuling Zhong, Maurizio Porfiri, John-Ross Rizzo
Main category: cs.CV
TL;DR: A computer vision system that detects construction hazards to assist blind and visually impaired individuals in urban navigation by identifying construction elements, scaffolding, and signage with high accuracy.
Details
Motivation: Construction zones pose significant safety risks for blind/visually impaired people due to temporary obstacles that standard navigation tools overlook, and existing hazard detection systems struggle with the visual variability of construction sites.Method: Integrated three computer vision modules: open-vocabulary object detector for construction elements, YOLO-based model for scaffolding/poles, and OCR for construction signage. Tested both statically at sites and dynamically via first-person walking videos.
Result: 88.56% overall accuracy in static testing, 87.26% accuracy in dynamic testing (rising to 92.0% with filtering). Reliable detection within 2-10 meters and approach angles up to 75°, with perfect detection at 2-4 meters.
Conclusion: The system provides reliable real-time construction site detection at sufficient distances to enable advance warnings, allowing visually impaired individuals to make safer mobility decisions like proceeding cautiously or rerouting.
Abstract: Purpose: Navigating urban environments poses significant challenges for individuals who are blind or have low vision, especially in areas affected by construction. Construction zones introduce hazards such as uneven surfaces, barriers, hazardous materials, excessive noise, and altered routes that obstruct familiar paths and compromise safety. Although navigation tools assist in trip planning, they often overlook these temporary obstacles. Existing hazard detection systems also struggle with the visual variability of construction sites. Methods: We developed a computer vision–based assistive system integrating three modules: an open-vocabulary object detector to identify diverse construction-related elements, a YOLO-based model specialized in detecting scaffolding and poles, and an optical character recognition module to interpret construction signage. Results: In static testing at seven construction sites using images from multiple stationary viewpoints, the system achieved 88.56% overall accuracy. It consistently identified relevant objects within 2–10 meters and at approach angles up to 75$^{\circ}$. At 2–4 meters, detection was perfect (100%) across all angles. Even at 10 meters, six of seven sites remained detectable within a 15$^{\circ}$ approach. In dynamic testing along a 0.5-mile urban route containing eight construction sites, the system analyzed every frame of a first-person walking video. It achieved 87.26% accuracy in distinguishing construction from non-construction areas, rising to 92.0% with a 50-frame majority vote filter. Conclusion: The system can reliably detect construction sites in real time and at sufficient distances to provide advance warnings, enabling individuals with visual impairments to make safer mobility decisions such as proceeding with caution or rerouting.
[240] LEDiT: Your Length-Extrapolatable Diffusion Transformer without Positional Encoding
Shen Zhang, Siyuan Liang, Yaning Tan, Zhaowei Chen, Linze Li, Ge Wu, Yuhao Chen, Shuheng Li, Zhenyu Zhao, Caihua Chen, Jiajun Liang, Yao Tang
Main category: cs.CV
TL;DR: LEDiT is a Length-Extrapolatable Diffusion Transformer that eliminates explicit positional encodings and uses causal attention with locality enhancement to enable high-resolution image generation (up to 4x scaling) without performance degradation.
Details
Motivation: Diffusion transformers struggle with resolution extrapolation due to positional encoding limitations - explicit PEs like RoPE degrade performance when inferring at resolutions different from training.Method: Replaces explicit positional encodings with causal attention that implicitly encodes global positional information, plus a locality enhancement module to capture fine-grained local details.
Result: Supports up to 4x resolution scaling (e.g., 256x256 to 512x512) with better image quality than state-of-the-art methods on conditional and text-to-image generation tasks.
Conclusion: LEDiT offers a promising alternative to RoPE-based methods and demonstrates effective length extrapolation for diffusion transformers.
Abstract: Diffusion transformers (DiTs) struggle to generate images at resolutions higher than their training resolutions. The primary obstacle is that the explicit positional encodings(PE), such as RoPE, need extrapolating to unseen positions which degrades performance when the inference resolution differs from training. In this paper, We propose a Length-Extrapolatable Diffusion Transformer~(LEDiT) to overcome this limitation. LEDiT needs no explicit PEs, thereby avoiding PE extrapolation. The key innovation of LEDiT lies in the use of causal attention. We demonstrate that causal attention can implicitly encode global positional information and show that such information facilitates extrapolation. We further introduce a locality enhancement module, which captures fine-grained local information to complement the global coarse-grained position information encoded by causal attention. Experimental results on both conditional and text-to-image generation tasks demonstrate that LEDiT supports up to 4x resolution scaling (e.g., from 256x256 to 512x512), achieving better image quality compared to the state-of-the-art length extrapolation methods. We believe that LEDiT marks a departure from the standard RoPE-based methods and offers a promising insight into length extrapolation. Project page: https://shenzhang2145.github.io/ledit/
[241] Adversarial Robustness of Discriminative Self-Supervised Learning in Vision
Ömer Veysel Çağatan, Ömer Faruk Tal, M. Emre Gürsoy
Main category: cs.CV
TL;DR: This paper evaluates the adversarial robustness of self-supervised learning (SSL) models compared to supervised models across various computer vision tasks.
Details
Motivation: Despite significant advances in SSL for visual representation learning, there is limited comprehensive evaluation of its adversarial robustness compared to supervised learning approaches.Method: The study evaluates seven discriminative SSL models and one supervised model across ImageNet classification, transfer learning, segmentation, and detection tasks, analyzing factors like architecture, training duration, data augmentations, and batch sizes.
Result: SSL models generally show better adversarial robustness than supervised models on ImageNet, and this advantage extends to transfer learning with linear evaluation. However, the robustness gap narrows with fine-tuning and diminishes in segmentation/detection tasks.
Conclusion: SSL provides better adversarial robustness in certain scenarios but this advantage is context-dependent, influenced by evaluation methods and task types.
Abstract: Self-supervised learning (SSL) has advanced significantly in visual representation learning, yet comprehensive evaluations of its adversarial robustness remain limited. In this study, we evaluate the adversarial robustness of seven discriminative self-supervised models and one supervised model across diverse tasks, including ImageNet classification, transfer learning, segmentation, and detection. Our findings suggest that discriminative SSL models generally exhibit better robustness to adversarial attacks compared to their supervised counterpart on ImageNet, with this advantage extending to transfer learning when using linear evaluation. However, when fine-tuning is applied, the robustness gap between SSL and supervised models narrows considerably. Similarly, this robustness advantage diminishes in segmentation and detection tasks. We also investigate how various factors might influence adversarial robustness, including architectural choices, training duration, data augmentations, and batch sizes. Our analysis contributes to the ongoing exploration of adversarial robustness in visual self-supervised representation systems.
[242] Challenges and Trends in Egocentric Vision: A Survey
Xiang Li, Heqian Qiu, Lanxiao Wang, Hanwen Zhang, Chenghao Qi, Linfeng Han, Huiyu Xiong, Hongliang Li
Main category: cs.CV
TL;DR: This paper provides a comprehensive survey of egocentric vision understanding, categorizing tasks into subject, object, environment, and hybrid understanding, while summarizing challenges, datasets, and future directions.
Details
Motivation: With the rapid development of AI technologies and wearable devices, egocentric vision has emerged as a new research direction that captures visual data from a human perspective, attracting widespread attention from academia and industry.Method: The paper systematically analyzes egocentric scene components and categorizes tasks into four main areas: subject understanding, object understanding, environment understanding, and hybrid understanding, while exploring sub-tasks within each category.
Result: The survey summarizes current challenges, trends, and high-quality datasets in egocentric vision, providing valuable resources for future research in this field.
Conclusion: Egocentric vision technologies have broad applications in augmented reality, virtual reality, and embodied intelligence, with promising future research directions based on the latest developments in the field.
Abstract: With the rapid development of artificial intelligence technologies and wearable devices, egocentric vision understanding has emerged as a new and challenging research direction, gradually attracting widespread attention from both academia and industry. Egocentric vision captures visual and multimodal data through cameras or sensors worn on the human body, offering a unique perspective that simulates human visual experiences. This paper provides a comprehensive survey of the research on egocentric vision understanding, systematically analyzing the components of egocentric scenes and categorizing the tasks into four main areas: subject understanding, object understanding, environment understanding, and hybrid understanding. We explore in detail the sub-tasks within each category. We also summarize the main challenges and trends currently existing in the field. Furthermore, this paper presents an overview of high-quality egocentric vision datasets, offering valuable resources for future research. By summarizing the latest advancements, we anticipate the broad applications of egocentric vision technologies in fields such as augmented reality, virtual reality, and embodied intelligence, and propose future research directions based on the latest developments in the field.
[243] Cross-Domain Underwater Image Enhancement Guided by No-Reference Image Quality Assessment: A Transfer Learning Approach
Zhi Zhang, Minfu Li, Lu Li, Daoyi Chen
Main category: cs.CV
TL;DR: Trans-UIE is a transfer learning-based underwater image enhancement model that addresses domain discrepancy and dataset scarcity issues by using pretraining with Pearson correlation loss and fine-tuning guided by no-reference image quality assessment metrics.
Details
Motivation: Underwater image enhancement faces two major challenges: (1) pseudo ground truth labels in datasets cause domain discrepancy in supervised learning, and (2) scarce underwater datasets lead to overfitting and distribution shift.Method: Proposes Trans-UIE with transfer learning: pretraining captures fundamental UIE paradigms using Pearson correlation loss to prevent overfitting, then fine-tuning uses both reference and non-reference datasets guided by NR-IQA metrics from above-water scenes to avoid confirmation bias.
Result: Experimental results on both full-reference and no-reference underwater benchmark datasets show that Trans-UIE significantly outperforms state-of-the-art methods.
Conclusion: The transfer learning approach with NR-IQA guidance effectively addresses domain discrepancy and dataset scarcity issues in underwater image enhancement, achieving superior performance compared to existing methods.
Abstract: Single underwater image enhancement (UIE) is a challenging ill-posed problem, but its development is hindered by two major issues: (1) The labels in underwater reference datasets are pseudo labels, relying on these pseudo ground truths in supervised learning leads to domain discrepancy. (2) Underwater reference datasets are scarce, making training on such small datasets prone to overfitting and distribution shift. To address these challenges, we propose Trans-UIE, a transfer learning-based UIE model that captures the fundamental paradigms of UIE through pretraining and utilizes a dataset composed of both reference and non-reference datasets for fine-tuning. However, fine-tuning the model using only reconstruction loss may introduce confirmation bias. To mitigate this, our method leverages no-reference image quality assessment (NR-IQA) metrics from above-water scenes to guide the transfer learning process across domains while generating enhanced images with the style of the above-water image domain. Additionally, to reduce the risk of overfitting during the pretraining stage, we introduce Pearson correlation loss. Experimental results on both full-reference and no-reference underwater benchmark datasets demonstrate that Trans-UIE significantly outperforms state-of-the-art methods.
[244] Multimodal Reference Visual Grounding
Yangxiao Lu, Ruosen Li, Liqiang Jing, Jikai Wang, Xinya Du, Yunhui Guo, Nicholas Ruozzi, Yu Xiang
Main category: cs.CV
TL;DR: Introduces Multimodal Reference Visual Grounding (MRVG) - a new task that uses reference images to help detect similar objects in query images based on language expressions, addressing limitations of current LVLMs in differentiating visually similar objects.
Details
Motivation: Current Large Vision-Language Models struggle to differentiate between visually similar objects (e.g., Diet Coke vs regular Coke). Reference images can provide additional context to improve visual grounding accuracy for such challenging cases.Method: Proposes MRVG-Net which combines few-shot object detection with Large Language Models for object matching. The approach efficiently uses reference images from a database to help detect target objects in query images based on language expressions.
Result: The method achieves superior visual grounding performance compared to state-of-the-art LVLMs like Qwen2.5-VL-72B, demonstrating effectiveness in handling similar object differentiation.
Conclusion: The work bridges the gap between few-shot detection and visual grounding, unlocking new capabilities for visual understanding with applications in robotics. A new dataset is introduced to study this problem.
Abstract: Visual grounding focuses on detecting objects from images based on language expressions. Recent Large Vision-Language Models (LVLMs) have significantly advanced visual grounding performance by training large models with large-scale datasets. However, the problem remains challenging, especially when similar objects appear in the input image. For example, an LVLM may not be able to differentiate Diet Coke and regular Coke in an image. In this case, if additional reference images of Diet Coke and regular Coke are available, it can help the visual grounding of similar objects. In this work, we introduce a new task named Multimodal Reference Visual Grounding (MRVG). In this task, a model has access to a set of reference images of objects in a database. Based on these reference images and a language expression, the model is required to detect a target object from a query image. We first introduce a new dataset to study the MRVG problem. Then we introduce a novel method, named MRVG-Net, to solve this visual grounding problem. We show that by efficiently using reference images with few-shot object detection and using Large Language Models (LLMs) for object matching, our method achieves superior visual grounding performance compared to the state-of-the-art LVLMs such as Qwen2.5-VL-72B. Our approach bridges the gap between few-shot detection and visual grounding, unlocking new capabilities for visual understanding, which has wide applications in robotics. Project page with our video, code, and dataset: https://irvlutd.github.io/MultiGrounding
[245] Towards Visual Text Grounding of Multimodal Large Language Model
Ming Li, Ruiyi Zhang, Jian Chen, Chenguang Wang, Jiuxiang Gu, Yufan Zhou, Franck Dernoncourt, Wanrong Zhu, Tianyi Zhou, Tong Sun
Main category: cs.CV
TL;DR: TRIG introduces a new benchmark and training dataset for improving visual text grounding capabilities of Multimodal Large Language Models (MLLMs) on text-rich document images, addressing limitations in current models’ ability to handle complex document layouts.
Details
Motivation: Current MLLMs struggle with visual text grounding in text-rich document images due to complex layouts, and existing benchmarks focus mainly on natural images rather than document images, creating a significant gap in evaluation and improvement.Method: Proposed an OCR-LLM-human interaction pipeline to create 800 manually annotated QA pairs as benchmark and 90K synthetic training data from four diverse datasets. Also introduced two TRIG methods: general instruction tuning and plug-and-play efficient embedding.
Result: Comprehensive evaluation revealed substantial limitations in current MLLMs’ grounding capability on text-rich images. Finetuning MLLMs on the synthetic dataset showed promising improvements in spatial reasoning and grounding capabilities.
Conclusion: The TRIG benchmark and training dataset effectively address the gap in text-rich image grounding evaluation, and the proposed methods successfully enhance MLLMs’ performance on document image understanding tasks.
Abstract: Despite the existing evolution of Multimodal Large Language Models (MLLMs), a non-neglectable limitation remains in their struggle with visual text grounding, especially in text-rich images of documents. Document images, such as scanned forms and infographics, highlight critical challenges due to their complex layouts and textual content. However, current benchmarks do not fully address these challenges, as they mostly focus on visual grounding on natural images, rather than text-rich document images. Thus, to bridge this gap, we introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking and improving the Text-Rich Image Grounding capabilities of MLLMs in document question-answering. Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark and a large-scale training set of 90$ synthetic data based on four diverse datasets. A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images. In addition, we propose two simple and effective TRIG methods based on general instruction tuning and plug-and-play efficient embedding, respectively. By finetuning MLLMs on our synthetic dataset, they promisingly improve spatial reasoning and grounding capabilities.
[246] Unsupervised Cross-Domain 3D Human Pose Estimation via Pseudo-Label-Guided Global Transforms
Jingjing Liu, Zhiyong Wang, Xinyu Fan, Amirhossein Dadashzadeh, Honghai Liu, Majid Mirmehdi
Main category: cs.CV
TL;DR: A novel framework for cross-domain 3D human pose estimation that addresses domain shifts by explicitly transforming pose positions between source and target camera coordinate systems using human-centered coordinates and iterative pseudo-label refinement.
Details
Motivation: Existing 3D human pose estimation methods suffer performance degradation in cross-scenario inference due to domain shifts, particularly from camera viewpoint and position variations that affect global pose positions.Method: Proposes a three-module framework: 1) Pseudo-Label Generation Module for target 2D poses, 2) Global Transformation Module using human-centered coordinate system for cross-domain alignment, 3) Pose Augmentor for posture and body size variations, with iterative refinement.
Result: Outperforms state-of-the-art approaches on cross-dataset benchmarks (Human3.6M, MPI-INF-3DHP, 3DPW) and even surpasses target-trained models.
Conclusion: The proposed method effectively addresses domain adaptation challenges in 3D human pose estimation through explicit global transformations and iterative refinement, achieving superior cross-domain performance.
Abstract: Existing 3D human pose estimation methods often suffer in performance, when applied to cross-scenario inference, due to domain shifts in characteristics such as camera viewpoint, position, posture, and body size. Among these factors, camera viewpoints and locations have been shown to contribute significantly to the domain gap by influencing the global positions of human poses. To address this, we propose a novel framework that explicitly conducts global transformations between pose positions in the camera coordinate systems of source and target domains. We start with a Pseudo-Label Generation Module that is applied to the 2D poses of the target dataset to generate pseudo-3D poses. Then, a Global Transformation Module leverages a human-centered coordinate system as a novel bridging mechanism to seamlessly align the positional orientations of poses across disparate domains, ensuring consistent spatial referencing. To further enhance generalization, a Pose Augmentor is incorporated to address variations in human posture and body size. This process is iterative, allowing refined pseudo-labels to progressively improve guidance for domain adaptation. Our method is evaluated on various cross-dataset benchmarks, including Human3.6M, MPI-INF-3DHP, and 3DPW. The proposed method outperforms state-of-the-art approaches and even outperforms the target-trained model.
[247] ChartQA-X: Generating Explanations for Visual Chart Reasoning
Shamanthak Hegde, Pooyan Fazli, Hasti Seifi
Main category: cs.CV
TL;DR: ChartQA-X is a comprehensive dataset for generating detailed explanations alongside chart question-answering, showing that model-generated explanations can surpass human-written ones in quality and improve model performance significantly.
Details
Motivation: The need to explain complex information from chart images effectively for data-driven decision-making, addressing the challenge of generating detailed explanations alongside answering chart questions.Method: Created ChartQA-X dataset with 30,299 chart samples across four chart types, paired with questions, answers, and explanations generated and selected based on faithfulness, informativeness, coherence, and perplexity metrics.
Result: Human evaluation with 245 participants shows model-generated explanations surpass human-written ones in accuracy and logic, and are comparable in clarity and overall quality. Models fine-tuned on ChartQA-X show up to 24.57 points improvement in explanation quality and 18.96 percentage points in QA accuracy.
Conclusion: Integrating explanatory narratives with answers enables more effective conveyance of complex visual information, improving comprehension and trust in generated responses.
Abstract: The ability to explain complex information from chart images is vital for effective data-driven decision-making. In this work, we address the challenge of generating detailed explanations alongside answering questions about charts. We present ChartQA-X, a comprehensive dataset comprising 30,299 chart samples across four chart types, each paired with contextually relevant questions, answers, and explanations. Explanations are generated and selected based on metrics such as faithfulness, informativeness, coherence, and perplexity. Our human evaluation with 245 participants shows that model-generated explanations in ChartQA-X surpass human-written explanations in accuracy and logic and are comparable in terms of clarity and overall quality. Moreover, models fine-tuned on ChartQA-X show substantial improvements across various metrics, including absolute gains of up to 24.57 points in explanation quality, 18.96 percentage points in question-answering accuracy, and 14.75 percentage points on unseen benchmarks for the same task. By integrating explanatory narratives with answers, our approach enables agents to convey complex visual information more effectively, improving comprehension and greater trust in the generated responses.
[248] Revisiting Residual Connections: Orthogonal Updates for Stable and Efficient Deep Networks
Giyeong Oh, Woohyun Cho, Siyeol Kim, Suhwan Choi, Younjae Yu
Main category: cs.CV
TL;DR: Orthogonal Residual Update decomposes module outputs into components parallel and orthogonal to input streams, adding only the orthogonal part to encourage learning novel features and improve generalization.
Details
Motivation: Standard residual connections may underutilize module capacity by predominantly reinforcing existing feature directions rather than learning entirely new representational directions.Method: Decompose the module’s output relative to the input stream and add only the component orthogonal to this stream, guiding modules to contribute primarily new representational directions.
Result: Improves generalization accuracy and training stability across diverse architectures (ResNetV2, Vision Transformers) and datasets (CIFARs, TinyImageNet, ImageNet-1k), achieving +4.3% top-1 accuracy gain for ViT-B on ImageNet-1k.
Conclusion: Orthogonal residual updates foster richer feature learning and more efficient training by promoting contributions of novel representational directions rather than reinforcing existing ones.
Abstract: Residual connections are pivotal for deep neural networks, enabling greater depth by mitigating vanishing gradients. However, in standard residual updates, the module’s output is directly added to the input stream. This can lead to updates that predominantly reinforce or modulate the existing stream direction, potentially underutilizing the module’s capacity for learning entirely novel features. In this work, we introduce Orthogonal Residual Update: we decompose the module’s output relative to the input stream and add only the component orthogonal to this stream. This design aims to guide modules to contribute primarily new representational directions, fostering richer feature learning while promoting more efficient training. We demonstrate that our orthogonal update strategy improves generalization accuracy and training stability across diverse architectures (ResNetV2, Vision Transformers) and datasets (CIFARs, TinyImageNet, ImageNet-1k), achieving, for instance, a +4.3%p top-1 accuracy gain for ViT-B on ImageNet-1k.
[249] LiDAR MOT-DETR: A LiDAR-based Two-Stage Transformer for 3D Multiple Object Tracking
Martha Teiko Teye, Ori Maoz, Matthias Rottmann
Main category: cs.CV
TL;DR: A LiDAR-based two-stage transformer approach for multi-object tracking that uses a smoother stage to refine detections and a tracker stage with DETR-based attention to maintain object identities across frames.
Details
Motivation: Traditional tracking systems struggle with maintaining consistent object identities in crowded or fast-moving scenes due to sparse LiDAR data and reliance on hand-crafted features. The paper addresses the need for better temporal coherence in LiDAR-based tracking.Method: Two-stage DETR-inspired transformer: 1) Smoother stage refines LiDAR object detections across temporal windows, 2) Tracker stage uses DETR-based attention to associate tracked objects with refined detections using point cloud context. Trained on nuScenes and KITTI datasets in online and offline modes.
Result: Online mode outperforms LiDAR-only baseline and SOTA models on nuScenes with aMOTA of 0.724 and aMOTP of 0.475. Offline mode provides additional 3 percentage point improvement in aMOTP.
Conclusion: The proposed two-stage transformer approach effectively addresses LiDAR tracking challenges, demonstrating strong performance in both online and offline modes with significant improvements over existing methods.
Abstract: Multi-object tracking from LiDAR point clouds presents unique challenges due to the sparse and irregular nature of the data, compounded by the need for temporal coherence across frames. Traditional tracking systems often rely on hand-crafted features and motion models, which can struggle to maintain consistent object identities in crowded or fast-moving scenes. We present a lidar-based two-staged DETR inspired transformer; a smoother and tracker. The smoother stage refines lidar object detections, from any off-the-shelf detector, across a moving temporal window. The tracker stage uses a DETR-based attention block to maintain tracks across time by associating tracked objects with the refined detections using the point cloud as context. The model is trained on the datasets nuScenes and KITTI in both online and offline (forward peeking) modes demonstrating strong performance across metrics such as ID-switch and multiple object tracking accuracy (MOTA). The numerical results indicate that the online mode outperforms the lidar-only baseline and SOTA models on the nuScenes dataset, with an aMOTA of 0.724 and an aMOTP of 0.475, while the offline mode provides an additional 3 pp aMOTP.
[250] Redemption Score: A Multi-Modal Evaluation Framework for Image Captioning via Distributional, Perceptual, and Linguistic Signal Triangulation
Ashim Dahal, Ankit Ghimire, Saydul Akbar Murad, Nick Rahimi
Main category: cs.CV
TL;DR: Redemption Score (RS) is a novel hybrid framework for evaluating image captions by combining three complementary signals: mutual information divergence for distributional alignment, DINO-based perceptual similarity for visual grounding, and LLM text embeddings for contextual similarity.
Details
Motivation: Current image caption evaluation metrics often fail to comprehensively assess both visual semantics and language pragmatics, leading to incomplete evaluation of caption quality.Method: The framework triangulates three signals: (1) Mutual Information Divergence for global image-text alignment, (2) DINO-based perceptual similarity of cycle-generated images for visual grounding, and (3) LLM Text Embeddings for contextual text similarity against human references.
Result: On Flickr8k benchmark, RS achieves Kendall-τ of 58.42, outperforming most prior methods and showing superior correlation with human judgments without task-specific training. It also demonstrates consistent performance on Conceptual Captions and MS COCO.
Conclusion: RS provides a more robust and nuanced evaluation by thoroughly examining both visual accuracy and text quality together, offering a holistic assessment framework for image captions.
Abstract: Evaluating image captions requires cohesive assessment of both visual semantics and language pragmatics, which is often not entirely captured by most metrics. We introduce Redemption Score(RS), a novel hybrid framework that ranks image captions by triangulating three complementary signals: (1) Mutual Information Divergence (MID) for global image-text distributional alignment, (2) DINO-based perceptual similarity of cycle-generated images for visual grounding, and (3) LLM Text Embeddings for contextual text similarity against human references. A calibrated fusion of these signals allows RS to offer a more holistic assessment. On the Flickr8k benchmark, RS achieves a Kendall-$\tau$ of 58.42, outperforming most prior methods and demonstrating superior correlation with human judgments without requiring task-specific training. Our framework provides a more robust and nuanced evaluation by thoroughly examining both the visual accuracy and text quality together, with consistent performance across Conceptual Captions and MS COCO.
[251] Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation
Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, Jianfei Chen, Song Han, Kurt Keutzer, Ion Stoica
Main category: cs.CV
TL;DR: SVG2 is a training-free framework that accelerates Diffusion Transformers (DiTs) for video generation by using semantic-aware permutation to cluster tokens based on similarity, achieving significant speedup while maintaining quality.
Details
Motivation: DiTs suffer from quadratic attention complexity causing latency. Existing sparse attention methods fail to achieve optimal quality due to inaccurate token identification (position-based clustering) and computation waste (scattered tokens).Method: Proposes semantic-aware permutation using k-means to cluster tokens by semantic similarity, top-p dynamic budget control, and customized kernel implementations for efficient computation without padding.
Result: Achieves up to 2.30x and 1.89x speedup on HunyuanVideo and Wan 2.1 respectively, while maintaining PSNR of up to 30 and 26.
Conclusion: SVG2 provides a Pareto-optimal trade-off between generation quality and efficiency by maximizing identification accuracy and minimizing computation waste through semantic clustering.
Abstract: Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention. By computing only critical tokens, sparse attention reduces computational costs and offers a promising acceleration approach. However, we identify that existing methods fail to approach optimal generation quality under the same computation budget for two reasons: (1) Inaccurate critical token identification: current methods cluster tokens based on position rather than semantics, leading to imprecise aggregated representations. (2) Excessive computation waste: critical tokens are scattered among non-critical ones, leading to wasted computation on GPUs, which are optimized for processing contiguous tokens. In this paper, we propose SVG2, a training-free framework that maximizes identification accuracy and minimizes computation waste, achieving a Pareto frontier trade-off between generation quality and efficiency. The core of SVG2 is semantic-aware permutation, which clusters and reorders tokens based on semantic similarity using k-means. This approach ensures both a precise cluster representation, improving identification accuracy, and a densified layout of critical tokens, enabling efficient computation without padding. Additionally, SVG2 integrates top-p dynamic budget control and customized kernel implementations, achieving up to 2.30x and 1.89x speedup while maintaining a PSNR of up to 30 and 26 on HunyuanVideo and Wan 2.1, respectively.
[252] EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis
Shengyuan Liu, Boyun Zheng, Wenting Chen, Zhihao Peng, Zhenfei Yin, Jing Shao, Jiancong Hu, Yixuan Yuan
Main category: cs.CV
TL;DR: EndoBench is a comprehensive benchmark for evaluating multi-modal large language models (MLLMs) in endoscopic procedures, covering 4 scenarios, 12 clinical tasks, and 6,832 VQA pairs across 21 datasets.
Details
Motivation: Current benchmarks for MLLMs in endoscopy are limited to specific scenarios and small task sets, failing to capture real-world diversity and full clinical workflow requirements.Method: Created EndoBench with multi-dimensional evaluation framework spanning anatomical recognition, lesion analysis, spatial localization, and surgical operations across 4 endoscopic scenarios and 12 clinical tasks with 5 levels of visual prompting granularities.
Result: Evaluation of 23 state-of-the-art models shows proprietary MLLMs outperform open-source and medical-specialized models but still trail human experts; medical-domain fine-tuning boosts accuracy; performance is sensitive to prompt format and task complexity.
Conclusion: EndoBench establishes a new standard for evaluating MLLMs in endoscopy, highlighting progress but persistent gaps between current models and expert clinical reasoning.
Abstract: Endoscopic procedures are essential for diagnosing and treating internal diseases, and multi-modal large language models (MLLMs) are increasingly applied to assist in endoscopy analysis. However, current benchmarks are limited, as they typically cover specific endoscopic scenarios and a small set of clinical tasks, failing to capture the real-world diversity of endoscopic scenarios and the full range of skills needed in clinical workflows. To address these issues, we introduce EndoBench, the first comprehensive benchmark specifically designed to assess MLLMs across the full spectrum of endoscopic practice with multi-dimensional capacities. EndoBench encompasses 4 distinct endoscopic scenarios, 12 specialized clinical tasks with 12 secondary subtasks, and 5 levels of visual prompting granularities, resulting in 6,832 rigorously validated VQA pairs from 21 diverse datasets. Our multi-dimensional evaluation framework mirrors the clinical workflow–spanning anatomical recognition, lesion analysis, spatial localization, and surgical operations–to holistically gauge the perceptual and diagnostic abilities of MLLMs in realistic scenarios. We benchmark 23 state-of-the-art models, including general-purpose, medical-specialized, and proprietary MLLMs, and establish human clinician performance as a reference standard. Our extensive experiments reveal: (1) proprietary MLLMs outperform open-source and medical-specialized models overall, but still trail human experts; (2) medical-domain supervised fine-tuning substantially boosts task-specific accuracy; and (3) model performance remains sensitive to prompt format and clinical task complexity. EndoBench establishes a new standard for evaluating and advancing MLLMs in endoscopy, highlighting both progress and persistent gaps between current models and expert clinical reasoning. We publicly release our benchmark and code.
[253] To Trust Or Not To Trust Your Vision-Language Model’s Prediction
Hao Dong, Moru Liu, Jian Liang, Eleni Chatzi, Olga Fink
Main category: cs.CV
TL;DR: TrustVLM is a training-free framework that improves misclassification detection in Vision-Language Models by leveraging the modality gap and distinct concept representations in image embedding space.
Details
Motivation: VLMs are susceptible to confident but incorrect predictions, posing risks in safety-critical domains where erroneous predictions can have severe consequences.Method: Proposes a novel confidence-scoring function that leverages the modality gap and distinct concept representations in the image embedding space to detect misclassifications without requiring retraining.
Result: State-of-the-art performance across 17 datasets, 4 architectures, and 2 VLMs with improvements up to 51.87% in AURC, 9.14% in AUROC, and 32.42% in FPR95 compared to existing baselines.
Conclusion: TrustVLM enables safer deployment of VLMs in real-world applications by improving reliability without requiring retraining.
Abstract: Vision-Language Models (VLMs) have demonstrated strong capabilities in aligning visual and textual modalities, enabling a wide range of applications in multimodal understanding and generation. While they excel in zero-shot and transfer learning scenarios, VLMs remain susceptible to misclassification, often yielding confident yet incorrect predictions. This limitation poses a significant risk in safety-critical domains, where erroneous predictions can lead to severe consequences. In this work, we introduce TrustVLM, a training-free framework designed to address the critical challenge of estimating when VLM’s predictions can be trusted. Motivated by the observed modality gap in VLMs and the insight that certain concepts are more distinctly represented in the image embedding space, we propose a novel confidence-scoring function that leverages this space to improve misclassification detection. We rigorously evaluate our approach across 17 diverse datasets, employing 4 architectures and 2 VLMs, and demonstrate state-of-the-art performance, with improvements of up to 51.87% in AURC, 9.14% in AUROC, and 32.42% in FPR95 compared to existing baselines. By improving the reliability of the model without requiring retraining, TrustVLM paves the way for safer deployment of VLMs in real-world applications. The code is available at https://github.com/EPFL-IMOS/TrustVLM.
[254] Probabilistic Online Event Downsampling
Andreu Girbau-Xalabarder, Jun Nagata, Shinichi Sumiyoshi, Ricard Marsal, Shin’ichi Satoh
Main category: cs.CV
TL;DR: POLED is a probabilistic framework for event camera downsampling that models event importance through an event-importance probability density function (ePDF), enabling adaptive online sampling without task-specific training.
Details
Motivation: Event cameras generate high-bandwidth data that requires downsampling, but existing methods use fixed heuristics that lack adaptability. There's a need for intelligent, scene-specific sampling that maintains performance under event budget constraints.Method: Proposes POLED framework with event-importance probability density function (ePDF) that can be arbitrarily defined. Operates online to estimate importance from raw event streams. Introduces contour-preserving ePDF that prioritizes structurally important events and enables zero-shot downsampling.
Result: Evaluated across four datasets and tasks (object classification, image interpolation, surface normal estimation, object detection), demonstrating that intelligent sampling is crucial for maintaining performance under event-budget constraints.
Conclusion: POLED provides an effective probabilistic framework for adaptive event downsampling that outperforms fixed heuristic approaches while enabling zero-shot compatibility with models trained on original event streams.
Abstract: Event cameras capture scene changes asynchronously on a per-pixel basis, enabling extremely high temporal resolution. However, this advantage comes at the cost of high bandwidth, memory, and computational demands. To address this, prior work has explored event downsampling, but most approaches rely on fixed heuristics or threshold-based strategies, limiting their adaptability. Instead, we propose a probabilistic framework, POLED, that models event importance through an event-importance probability density function (ePDF), which can be arbitrarily defined and adapted to different applications. Our approach operates in a purely online setting, estimating event importance on-the-fly from raw event streams, enabling scene-specific adaptation. Additionally, we introduce zero-shot event downsampling, where downsampled events must remain usable for models trained on the original event stream, without task-specific adaptation. We design a contour-preserving ePDF that prioritizes structurally important events and evaluate our method across four datasets and tasks–object classification, image interpolation, surface normal estimation, and object detection–demonstrating that intelligent sampling is crucial for maintaining performance under event-budget constraints. Code available.
[255] OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, Li Yi
Main category: cs.CV
TL;DR: OmniSpatial is a comprehensive benchmark for evaluating spatial reasoning in vision-language models, covering dynamic reasoning, complex spatial logic, spatial interaction, and perspective-taking with 50 subcategories and 8.4K question-answer pairs.
Details
Motivation: Current spatial reasoning evaluations for VLMs focus on basic relations (left/right, near/far, counting) which are approaching saturation, while more complex spatial reasoning remains a bottleneck.Method: Introduced OmniSpatial benchmark with careful manual annotation of 8.4K QA pairs across 4 major categories and 50 subcategories. Explored two strategies: PointGraph (explicit scene graph cues) and SpatialCoT (novel-view chain-of-thought).
Result: Extensive experiments show both open- and closed-source VLMs exhibit significant limitations in comprehensive spatial reasoning. The proposed strategies help bolster spatial reasoning capabilities.
Conclusion: OmniSpatial provides a challenging benchmark that reveals current VLMs’ limitations in complex spatial reasoning and offers promising directions for improvement through explicit structural cues and reasoning chains.
Abstract: Spatial reasoning is a key aspect of cognitive psychology and remains a bottleneck for current vision-language models (VLMs). While extensive research has aimed to evaluate or improve VLMs’ understanding of basic spatial relations, such as distinguishing left from right, near from far, and object counting, these tasks cover only the most elementary layer of spatial reasoning and are largely approaching saturation in the latest reasoning models. In this work, we introduce OmniSpatial, a comprehensive and challenging benchmark for spatial reasoning, grounded in cognitive psychology. OmniSpatial covers four major categories: dynamic reasoning, complex spatial logic, spatial interaction, and perspective-taking, with 50 fine-grained subcategories. Through careful manual annotation, we construct over 8.4K question-answer pairs. Extensive experiments show that both open- and closed-source VLMs exhibit significant limitations in comprehensive spatial reasoning. We also explore two strategies-PointGraph (explicit scene graph cues) and SpatialCoT (novel-view chain-of-thought)-to bolster spatial reasoning.
[256] A Quad-Step Approach to Uncertainty-Aware Deep Learning for Skin Cancer Classification
Hamzeh Asgharnezhad, Pegah Tabarisaadi, Abbas Khosravi, Roohallah Alizadehsani, U. Rajendra Acharya
Main category: cs.CV
TL;DR: This paper presents a comprehensive evaluation of deep learning models for skin cancer classification using transfer learning and uncertainty quantification on the HAM10000 dataset, proposing a feature-fusion model with predictive entropy loss that achieves superior performance.
Details
Motivation: Accurate skin cancer diagnosis is crucial for early treatment and improved patient outcomes. While deep learning models show promise, challenges remain due to data scarcity and limited uncertainty awareness in automated classification systems.Method: Benchmarked multiple pre-trained feature extractors (CLIP variants, ResNet50, DenseNet121, VGG16, EfficientNet-V2-Large) combined with traditional classifiers (SVM, XGBoost, logistic regression). Applied PCA settings and evaluated uncertainty quantification methods including Monte Carlo Dropout, Ensemble, and Ensemble Monte Carlo Dropout using uncertainty-aware metrics.
Result: LAION CLIP ViT-H/14 and ViT-L/14 at PCA-256 achieved the strongest baseline results. Ensemble methods with PCA-256 provided the best balance between accuracy and reliability. Feature fusion of top-performing extractors at PCA-256 with predictive entropy loss outperformed all prior configurations.
Conclusion: The proposed feature-fusion based model trained with predictive entropy loss function advances trustworthy DL-based skin cancer diagnosis by outperforming all previous configurations across both standard and uncertainty-aware evaluations.
Abstract: Accurate skin cancer diagnosis is vital for early treatment and improved patient outcomes. Deep learning (DL) models have shown promise in automating skin cancer classification, yet challenges remain due to data scarcity and limited uncertainty awareness. This study presents a comprehensive evaluation of DL-based skin lesion classification with transfer learning and uncertainty quantification (UQ) on the HAM10000 dataset. We benchmark several pre-trained feature extractors – including CLIP variants, ResNet50, DenseNet121, VGG16, and EfficientNet-V2-Large – combined with traditional classifiers such as SVM, XGBoost, and logistic regression. Multiple principal component analysis (PCA) settings (64, 128, 256, 512) are explored, with LAION CLIP ViT-H/14 and ViT-L/14 at PCA-256 achieving the strongest baseline results. In the UQ phase, Monte Carlo Dropout (MCD), Ensemble, and Ensemble Monte Carlo Dropout (EMCD) are applied and evaluated using uncertainty-aware metrics (UAcc, USen, USpe, UPre). Ensemble methods with PCA-256 provide the best balance between accuracy and reliability. Further improvements are obtained through feature fusion of top-performing extractors at PCA-256. Finally, we propose a feature-fusion based model trained with a predictive entropy (PE) loss function, which outperforms all prior configurations across both standard and uncertainty-aware evaluations, advancing trustworthy DL-based skin cancer diagnosis.
[257] NERO: Explainable Out-of-Distribution Detection with Neuron-level Relevance
Anju Chhetri, Jari Korhonen, Prashnna Gyawali, Binod Bhattarai
Main category: cs.CV
TL;DR: NERO is a novel OOD detection method that uses neuron-level relevance clustering and distance metrics to improve reliability in medical imaging by better identifying out-of-distribution samples.
Details
Motivation: Current OOD detection methods relying on feature or logit space representations may not fully capture OOD diversity, which is critical in medical imaging where identifying anomalies can prevent undetected diagnostic errors.Method: Proposes NERO scoring mechanism that clusters neuron-level relevance for each in-distribution class to form centroids, uses relevance distance metrics to quantify sample deviation, incorporates scaled relevance in bias terms, and combines feature norms for enhanced separability.
Result: Validated across multiple deep learning architectures on gastrointestinal imaging benchmarks (Kvasir and GastroVision), achieving improvements over state-of-the-art OOD detection methods.
Conclusion: NERO provides an effective and explainable OOD detection framework that enhances reliability in medical imaging applications by better identifying out-of-distribution samples.
Abstract: Ensuring reliability is paramount in deep learning, particularly within the domain of medical imaging, where diagnostic decisions often hinge on model outputs. The capacity to separate out-of-distribution (OOD) samples has proven to be a valuable indicator of a model’s reliability in research. In medical imaging, this is especially critical, as identifying OOD inputs can help flag potential anomalies that might otherwise go undetected. While many OOD detection methods rely on feature or logit space representations, recent works suggest these approaches may not fully capture OOD diversity. To address this, we propose a novel OOD scoring mechanism, called NERO, that leverages neuron-level relevance at the feature layer. Specifically, we cluster neuron-level relevance for each in-distribution (ID) class to form representative centroids and introduce a relevance distance metric to quantify a new sample’s deviation from these centroids, enhancing OOD separability. Additionally, we refine performance by incorporating scaled relevance in the bias term and combining feature norms. Our framework also enables explainable OOD detection. We validate its effectiveness across multiple deep learning architectures on the gastrointestinal imaging benchmarks Kvasir and GastroVision, achieving improvements over state-of-the-art OOD detection methods.
[258] SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model
Guankun Wang, Junyi Wang, Wenjin Mo, Long Bai, Kun Yuan, Ming Hu, Jinlin Wu, Junjun He, Yiming Huang, Nicolas Padoy, Zhen Lei, Hongbin Liu, Nassir Navab, Hongliang Ren
Main category: cs.CV
TL;DR: SurgVidLM is a novel video language model designed for both full and fine-grained surgical video comprehension, addressing the gap in existing methods that overlook detailed task execution analysis in surgical procedures.
Details
Motivation: Current Multimodal Large Language Models (MLLMs) focus on image-based analysis or global video understanding but lack fine-grained video reasoning capabilities needed for analyzing specific surgical processes and detailed task execution, which is critical for surgical training and robotic decision-making.Method: The authors propose SurgVidLM with a two-stage StageFocus mechanism: first stage extracts global procedural context, second stage performs high-frequency local analysis guided by temporal cues. They use Multi-frequency Fusion Attention to integrate low- and high-frequency visual tokens, and train on SVU-31K dataset containing over 31K video-instruction pairs.
Result: Experimental results show SurgVidLM significantly outperforms state-of-the-art Vid-LLMs of comparable parameter scale in both full and fine-grained video understanding tasks, demonstrating superior capability in capturing complex robot-assisted surgery contexts.
Conclusion: SurgVidLM successfully bridges the gap in fine-grained surgical video comprehension and represents the first video language model specifically designed for comprehensive surgical scene understanding, with promising applications for surgical training and robotic decision-making.
Abstract: Surgical scene understanding is critical for surgical training and robotic decision-making in robot-assisted surgery. Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated great potential for advancing scene perception in the medical domain, facilitating surgeons to understand surgical scenes and procedures. However, these methods are primarily oriented towards image-based analysis or global video understanding, overlooking the fine-grained video reasoning that is crucial for analyzing specific processes and capturing detailed task execution within a surgical procedure. To bridge this gap, we propose SurgVidLM, the first video language model designed to address both full and fine-grained surgical video comprehension. To train our SurgVidLM, we construct the SVU-31K that is a large-scale dataset with over 31K video-instruction pairs, enabling both holistic understanding and detailed analysis of surgical procedures. Building on this resource, SurgVidLM incorporates a two-stage StageFocus mechanism: the first stage extracts global procedural context, while the second stage performs high-frequency local analysis guided by temporal cues. We also develop the Multi-frequency Fusion Attention to effectively integrate low- and high-frequency visual tokens, ensuring the preservation of critical task-specific details. Experimental results demonstrate that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs of comparable parameter scale in both full and fine-grained video understanding tasks, showcasing its superior capability in capturing the context of complex robot-assisted surgeries. Our code and dataset will be publicly accessible soon.
[259] Intervening in Black Box: Concept Bottleneck Model for Enhancing Human Neural Network Mutual Understanding
Nuoye Xiong, Anqi Dong, Ning Wang, Cong Hua, Guangming Zhu, Lin Mei, Peiyi Shen, Liang Zhang
Main category: cs.CV
TL;DR: CBM-HNMU is a framework that enhances interpretability and accuracy of black-box models by using Concept Bottleneck Models to identify and refine detrimental concepts, then distilling corrected knowledge back into the original model.
Details
Motivation: Deep learning models are becoming increasingly complex and less interpretable, with existing explanation methods lacking effective interventions or only operating at sample-level without modifying the model itself.Method: Leverages Concept Bottleneck Model as an interpretable framework to approximate black-box reasoning, automatically identifies and refines detrimental concepts based on global gradient contributions, then distills corrected knowledge back into the black-box model.
Result: Evaluated on various CNN and transformer-based models across multiple datasets (Flower-102, CIFAR-10, CIFAR-100, FGVC-Aircraft, CUB-200), achieving maximum accuracy improvement of 2.64% and maximum increase in average accuracy of 1.03%.
Conclusion: CBM-HNMU successfully enhances both interpretability and accuracy of black-box models through concept-based intervention and knowledge distillation.
Abstract: Recent advances in deep learning have led to increasingly complex models with deeper layers and more parameters, reducing interpretability and making their decisions harder to understand. While many methods explain black-box reasoning, most lack effective interventions or only operate at sample-level without modifying the model itself. To address this, we propose the Concept Bottleneck Model for Enhancing Human-Neural Network Mutual Understanding (CBM-HNMU). CBM-HNMU leverages the Concept Bottleneck Model (CBM) as an interpretable framework to approximate black-box reasoning and communicate conceptual understanding. Detrimental concepts are automatically identified and refined (removed/replaced) based on global gradient contributions. The modified CBM then distills corrected knowledge back into the black-box model, enhancing both interpretability and accuracy. We evaluate CBM-HNMU on various CNN and transformer-based models across Flower-102, CIFAR-10, CIFAR-100, FGVC-Aircraft, and CUB-200, achieving a maximum accuracy improvement of 2.64% and a maximum increase in average accuracy across 1.03%. Source code is available at: https://github.com/XiGuaBo/CBM-HNMU.
[260] Diffusion models for multivariate subsurface generation and efficient probabilistic inversion
Roberto Miele, Niklas Linde
Main category: cs.CV
TL;DR: Diffusion models outperform VAEs and GANs for multivariate subsurface modeling, with improved corrections to Diffusion Posterior Sampling that enhance statistical robustness, posterior sampling, and computational efficiency.
Details
Motivation: To enhance multivariate modeling capabilities in subsurface geological scenarios and improve probabilistic inversion using diffusion models compared to existing methods like VAEs and GANs.Method: Proposed corrections to Diffusion Posterior Sampling approach, including a likelihood approximation accounting for inherent noise-contamination in diffusion modeling. Applied to conditional modeling with both local hard data (well logs) and nonlinear geophysics (fullstack seismic data).
Result: Significantly improved statistical robustness, enhanced sampling of posterior probability density function, reduced computational costs compared to original approach. Works with both hard and indirect conditioning data individually or simultaneously.
Conclusion: Diffusion models provide faster and more efficient probabilistic inversion for subsurface modeling, outperforming traditional methods by integrating inversion within the diffusion process rather than requiring outer-loop approaches like MCMC.
Abstract: Diffusion models offer stable training and state-of-the-art performance for deep generative modeling tasks. Here, we consider their use in the context of multivariate subsurface modeling and probabilistic inversion. We first demonstrate that diffusion models enhance multivariate modeling capabilities compared to variational autoencoders and generative adversarial networks. In diffusion modeling, the generative process involves a comparatively large number of time steps with update rules that can be modified to account for conditioning data. We propose different corrections to the popular Diffusion Posterior Sampling approach by Chung et al. (2023). In particular, we introduce a likelihood approximation accounting for the noise-contamination that is inherent in diffusion modeling. We assess performance in a multivariate geological scenario involving facies and correlated acoustic impedance. Conditional modeling is demonstrated using both local hard data (well logs) and nonlinear geophysics (fullstack seismic data). Our tests show significantly improved statistical robustness, enhanced sampling of the posterior probability density function and reduced computational costs, compared to the original approach. The method can be used with both hard and indirect conditioning data, individually or simultaneously. As the inversion is included within the diffusion process, it is faster than other methods requiring an outer-loop around the generative model, such as Markov chain Monte Carlo.
[261] VER-Bench: Evaluating MLLMs on Reasoning with Fine-Grained Visual Evidence
Chenhui Qiang, Zhaoyang Wei, Xumeng Han, Zipeng Wang, Siyao Li, Xiangyuan Lan, Jianbin Jiao, Zhenjun Han
Main category: cs.CV
TL;DR: VER-Bench is a novel framework to evaluate MLLMs’ ability to identify fine-grained visual clues (occupying only 0.25% of image area) and integrate them with world knowledge for complex reasoning.
Details
Motivation: Current benchmarks either focus on basic perception (lacking deep reasoning) or mainstream reasoning (missing subtle clues). Profound visual understanding depends more on interpreting subtle local details than perceiving macro-level objects.Method: VER-Bench comprises 374 carefully designed questions across 6 reasoning types (Geospatial, Temporal, Situational, Intent, System State, Symbolic), each with structured evidence including visual clues and question-related reasoning.
Result: The benchmark reveals current models’ limitations in extracting subtle visual evidence and constructing evidence-based arguments.
Conclusion: There is a need to enhance models’ capabilities in fine-grained visual evidence extraction, integration, and reasoning for genuine visual understanding and human-like analysis.
Abstract: With the rapid development of MLLMs, evaluating their visual capabilities has become increasingly crucial. Current benchmarks primarily fall into two main types: basic perception benchmarks, which focus on local details but lack deep reasoning (e.g., “what is in the image?”), and mainstream reasoning benchmarks, which concentrate on prominent image elements but may fail to assess subtle clues requiring intricate analysis. However, profound visual understanding and complex reasoning depend more on interpreting subtle, inconspicuous local details than on perceiving salient, macro-level objects. These details, though occupying minimal image area, often contain richer, more critical information for robust analysis. To bridge this gap, we introduce the VER-Bench, a novel framework to evaluate MLLMs’ ability to: 1) identify fine-grained visual clues, often occupying on average just 0.25% of the image area; 2) integrate these clues with world knowledge for complex reasoning. Comprising 374 carefully designed questions across Geospatial, Temporal, Situational, Intent, System State, and Symbolic reasoning, each question in VER-Bench is accompanied by structured evidence: visual clues and question-related reasoning derived from them. VER-Bench reveals current models’ limitations in extracting subtle visual evidence and constructing evidence-based arguments, highlighting the need to enhance models’s capabilities in fine-grained visual evidence extraction, integration, and reasoning for genuine visual understanding and human-like analysis. Dataset and additional materials are available https://github.com/verbta/ACMMM-25-Materials.
[262] VSF: Simple, Efficient, and Effective Negative Guidance in Few-Step Image Generation Models By Value Sign Flip
Wenqi Guo, Shan Du
Main category: cs.CV
TL;DR: VSF is a simple and efficient method for negative prompt guidance in few-step diffusion and flow-matching models by flipping attention values from negative prompts, outperforming existing methods with minimal computational overhead.
Details
Motivation: Existing negative prompt guidance methods like CFG, NASA, and NAG have limitations in few-step diffusion models. VSF aims to provide more effective negative prompt adherence with better efficiency.Method: VSF dynamically suppresses undesired content by flipping the sign of attention values from negative prompts. It works with MMDiT-style architectures like Stable Diffusion 3.5 Turbo and cross-attention-based models like Wan.
Result: VSF significantly improves negative prompt adherence compared to prior methods in few-step models and even CFG in non-few-step models, while maintaining competitive image quality in both static image and video generation tasks.
Conclusion: VSF provides superior negative prompt guidance with minimal computational overhead, making it an effective solution for few-step diffusion and flow-matching models across various architectures.
Abstract: We introduce Value Sign Flip (VSF), a simple and efficient method for incorporating negative prompt guidance in few-step diffusion and flow-matching image generation models. Unlike existing approaches such as classifier-free guidance (CFG), NASA, and NAG, VSF dynamically suppresses undesired content by flipping the sign of attention values from negative prompts. Our method requires only small computational overhead and integrates effectively with MMDiT-style architectures such as Stable Diffusion 3.5 Turbo, as well as cross-attention-based models like Wan. We validate VSF on challenging datasets with complex prompt pairs and demonstrate superior performance in both static image and video generation tasks. Experimental results show that VSF significantly improves negative prompt adherence compared to prior methods in few-step models, and even CFG in non-few-step models, while maintaining competitive image quality. Code and ComfyUI node are available in https://github.com/weathon/VSF/tree/main.
[263] Multimodal Chain of Continuous Thought for Latent-Space Reasoning in Vision-Language Models
Tan-Hanh Pham, Chris Ngo
Main category: cs.CV
TL;DR: MCOUT proposes continuous latent reasoning instead of language-based Chain-of-Thought for multimodal models, achieving significant accuracy improvements on benchmarks.
Details
Motivation: Current reasoning methods like Chain-of-Thought are text-based and struggle with dynamic alignment of multimodal information (audio, visual, textual).Method: MCOUT represents reasoning as continuous hidden vectors refined iteratively in joint latent space, with two variants: MCOUT-Base (reuses language model’s last hidden state) and MCOUT-Multi (integrates multimodal latent attention).
Result: Experiments on MMMU, ScienceQA, and MMStar show up to 8.23% accuracy gains over baselines and 8.27% BLEU score improvements across multiple-choice and open-ended tasks.
Conclusion: Latent continuous reasoning is a promising direction for advancing multimodal models beyond language-bound approaches, offering scalable human-like reflective inference.
Abstract: Many reasoning techniques for large multimodal models adapt language model approaches, such as Chain-of-Thought (CoT) prompting, which express reasoning as word sequences. While effective for text, these methods are suboptimal for multimodal contexts, struggling to align audio, visual, and textual information dynamically. To explore an alternative paradigm, we propose the Multimodal Chain of Continuous Thought (MCOUT), which enables reasoning directly in a joint latent space rather than in natural language. In MCOUT, the reasoning state is represented as a continuous hidden vector, iteratively refined and aligned with visual and textual embeddings, inspired by human reflective cognition. We develop two variants: MCOUT-Base, which reuses the language model`s last hidden state as the continuous thought for iterative reasoning, and MCOUT-Multi, which integrates multimodal latent attention to strengthen cross-modal alignment between visual and textual features. Experiments on benchmarks including MMMU, ScienceQA, and MMStar show that MCOUT consistently improves multimodal reasoning, yielding up to 8.23% accuracy gains over strong baselines and improving BLEU scores up to 8.27% across multiple-choice and open-ended tasks. These findings highlight latent continuous reasoning as a promising direction for advancing LMMs beyond language-bound CoT, offering a scalable framework for human-like reflective multimodal inference. Code is available at https://github.com/Hanhpt23/OmniMod.
[264] Enhancing Targeted Adversarial Attacks on Large Vision-Language Models via Intermediate Projector
Yiming Cao, Yanjie Li, Kaisheng Liang, Bin Xiao
Main category: cs.CV
TL;DR: A novel black-box targeted attack framework for Large Vision-Language Models that leverages the projector (Q-Former) to enhance attack effectiveness, granularity, and transferability.
Details
Motivation: Existing adversarial attack methods lack granularity for fine-grained attacks and neglect the projector's role in VLMs, posing safety concerns as adversaries can exploit model vulnerabilities to induce harmful outputs.Method: Proposes Intermediate Projector Guided Attack (IPGA) for global attacks by aligning Q-Former query outputs with targets, and augments it with Residual Query Alignment (RQA) for fine-grained attacks to preserve unrelated content.
Result: IPGA significantly outperforms baselines in global targeted attacks, and IPGA-R achieves superior success rates and content preservation in fine-grained attacks, with effective transferability to commercial VLMs like Google Gemini and OpenAI GPT.
Conclusion: The proposed framework addresses limitations of existing methods by leveraging the projector for more effective, granular, and transferable black-box targeted attacks on VLMs.
Abstract: The growing deployment of Large Vision-Language Models (VLMs) raises safety concerns, as adversaries may exploit model vulnerabilities to induce harmful outputs, with targeted black-box adversarial attacks posing a particularly severe threat. However, existing methods primarily maximize encoder-level global similarity, which lacks the granularity for stealthy and practical fine-grained attacks, where only specific target should be altered (e.g., modifying a car while preserving its background). Moreover, they largely neglect the projector, a key semantic bridge in VLMs for multimodal alignment. To address these limitations, we propose a novel black-box targeted attack framework that leverages the projector. Specifically, we utilize the widely adopted Querying Transformer (Q-Former) which transforms global image embeddings into fine-grained query outputs, to enhance attack effectiveness and granularity. For standard global targeted attack scenarios, we propose the Intermediate Projector Guided Attack (IPGA), which aligns Q-Former fine-grained query outputs with the target to enhance attack strength and exploits the intermediate pretrained Q-Former that is not fine-tuned for any specific Large Language Model (LLM) to improve attack transferability. For fine-grained attack scenarios, we augment IPGA with the Residual Query Alignment (RQA) module, which preserves unrelated content by constraining non-target query outputs to enhance attack granularity. Extensive experiments demonstrate that IPGA significantly outperforms baselines in global targeted attacks, and IPGA with RQA (IPGA-R) attains superior success rates and unrelated content preservation over baselines in fine-grained attacks. Our method also transfers effectively to commercial VLMs such as Google Gemini and OpenAI GPT.
[265] FastTracker: Real-Time and Accurate Visual Tracking
Hamidreza Hashempoor, Yu Dong Hwang
Main category: cs.CV
TL;DR: A generalized multi-object tracking framework that handles multiple object types with emphasis on vehicle tracking, featuring occlusion-aware re-identification and road-structure-aware tracklet refinement.
Details
Motivation: Conventional MOT systems are limited to pedestrian tracking and lack generalization to other object categories like vehicles in complex traffic scenes.Method: Incorporates two key components: (1) occlusion-aware re-identification mechanism for identity preservation of occluded objects, and (2) road-structure-aware tracklet refinement using semantic scene priors (lane directions, crosswalks, road boundaries).
Result: Achieves robust performance on new vehicle-focused benchmark and public benchmarks, with HOTA scores of 66.4 on MOT17 and 65.7 on MOT20 test sets.
Conclusion: The proposed framework demonstrates effectiveness in general-purpose object tracking while maintaining strong performance on conventional benchmarks.
Abstract: Conventional multi-object tracking (MOT) systems are predominantly designed for pedestrian tracking and often exhibit limited generalization to other object categories. This paper presents a generalized tracking framework capable of handling multiple object types, with a particular emphasis on vehicle tracking in complex traffic scenes. The proposed method incorporates two key components: (1) an occlusion-aware re-identification mechanism that enhances identity preservation for heavily occluded objects, and (2) a road-structure-aware tracklet refinement strategy that utilizes semantic scene priors such as lane directions, crosswalks, and road boundaries to improve trajectory continuity and accuracy. In addition, we introduce a new benchmark dataset comprising diverse vehicle classes with frame-level tracking annotations, specifically curated to support evaluation of vehicle-focused tracking methods. Extensive experimental results demonstrate that the proposed approach achieves robust performance on both the newly introduced dataset and several public benchmarks, highlighting its effectiveness in general-purpose object tracking. While our framework is designed for generalized multi-class tracking, it also achieves strong performance on conventional benchmarks, with HOTA scores of 66.4 on MOT17 and 65.7 on MOT20 test sets. Code and Benchmark are available: github.com/Hamidreza-Hashempoor/FastTracker, huggingface.co/datasets/Hamidreza-Hashemp/FastTracker-Benchmark.
[266] Chirality in Action: Time-Aware Video Representation Learning by Latent Straightening
Piyush Bagad, Andrew Zisserman
Main category: cs.CV
TL;DR: The paper introduces chiral action recognition to measure time-sensitivity in video representations and proposes a self-supervised method to create compact, time-aware video embeddings that outperform larger models.
Details
Motivation: Current video embeddings poorly represent temporal changes in everyday actions like opening/closing doors. The goal is to develop compact representations sensitive to visual change over time.Method: Self-supervised adaptation recipe using an auto-encoder with perceptual straightening-inspired latent space to inject time-sensitivity into frozen image features.
Result: Outperforms larger video models on Something-Something, EPIC-Kitchens, and Charade datasets, and improves classification when combined with existing models.
Conclusion: The proposed method successfully creates time-sensitive video representations that excel at distinguishing temporally opposite actions while being compact and effective.
Abstract: Our objective is to develop compact video representations that are sensitive to visual change over time. To measure such time-sensitivity, we introduce a new task: chiral action recognition, where one needs to distinguish between a pair of temporally opposite actions, such as “opening vs. closing a door”, “approaching vs. moving away from something”, “folding vs. unfolding paper”, etc. Such actions (i) occur frequently in everyday life, (ii) require understanding of simple visual change over time (in object state, size, spatial position, count . . . ), and (iii) are known to be poorly represented by many video embeddings. Our goal is to build time aware video representations which offer linear separability between these chiral pairs. To that end, we propose a self-supervised adaptation recipe to inject time-sensitivity into a sequence of frozen image features. Our model is based on an auto-encoder with a latent space with inductive bias inspired by perceptual straightening. We show that this results in a compact but time-sensitive video representation for the proposed task across three datasets: Something-Something, EPIC-Kitchens, and Charade. Our method (i) outperforms much larger video models pre-trained on large-scale video datasets, and (ii) leads to an improvement in classification performance on standard benchmarks when combined with these existing models.
[267] AutoOEP – A Multi-modal Framework for Online Exam Proctoring
Aryan Kashyap Naveen, Bhuvanesh Singla, Raajan Wankhade, Shreesha M, Ramu S, Ram Mohana Reddy Guddeti
Main category: cs.CV
TL;DR: AutoOEP is an automated online exam proctoring system that uses dual cameras and multi-modal analysis (face verification, gaze tracking, object detection) with LSTM-based temporal analysis to detect cheating behaviors with 90.7% accuracy.
Details
Motivation: Online education growth creates need for scalable academic integrity solutions. Traditional human proctoring is not scalable, while existing automated solutions are either intrusive or ineffective at detecting diverse cheating behaviors.Method: Uses dual-camera setup (frontal and workspace views) with Face Module (ArcFace for identity verification, head pose, gaze tracking, mouth movement) and Hand Module (YOLOv11 for prohibited object detection). Features aggregated into LSTM network for temporal pattern analysis and real-time cheating probability scoring.
Result: Achieves 90.7% accuracy in suspicious activity classification, mAP@.5 of 0.57 for prohibited object detection, processes video at 2.4 fps without GPU. System is resource-efficient and reduces human intervention needs.
Conclusion: AutoOEP is an effective, scalable solution for automated online exam proctoring that enhances academic integrity while being computationally efficient. Code is publicly available.
Abstract: The burgeoning of online education has created an urgent need for robust and scalable systems to ensure academic integrity during remote examinations. Traditional human proctoring is often not feasible at scale, while existing automated solutions can be intrusive or fail to detect a wide range of cheating behaviors. This paper introduces AutoOEP (Automated Online Exam Proctoring), a comprehensive, multi-modal framework that leverages computer vision and machine learning to provide effective, automated proctoring. The system utilizes a dual-camera setup to capture both a frontal view of the examinee and a side view of the workspace, minimizing blind spots. Our approach integrates several parallel analyses: the Face Module performs continuous identity verification using ArcFace, along with head pose estimation, gaze tracking, and mouth movement analysis to detect suspicious cues. Concurrently, the Hand Module employs a fine-tuned YOLOv11 model for detecting prohibited items (e.g., mobile phones, notes) and tracks hand proximity to these objects. Features from these modules are aggregated and fed into a Long Short-Term Memory (LSTM) network that analyzes temporal patterns to calculate a real-time cheating probability score. We evaluate AutoOEP on a custom-collected dataset simulating diverse exam conditions. Our system achieves an accuracy of 90.7% in classifying suspicious activities. The object detection component obtains a mean Average Precision (mAP@.5) of 0.57 for prohibited items, and the entire framework processes video streams at approximately 2.4 frames per second without a GPU. The results demonstrate that AutoOEP is an effective and resource-efficient solution for automated proctoring, significantly reducing the need for human intervention and enhancing the integrity of online assessments. The code is public and can be accessed at https://github.com/05kashyap/AutoOEP.
[268] OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling
Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, Mingyu Liu, Dingning Liu, Jiange Yang, Zhoujie Fu, Junyi Chen, Chunhua Shen, Jiangmiao Pang, Kaipeng Zhang, Tong He
Main category: cs.CV
TL;DR: OmniWorld is a large-scale multi-domain dataset for 4D world modeling that addresses data limitations in current benchmarks, enabling better 4D reconstruction and video generation.
Details
Motivation: Current 4D world modeling is constrained by limited data quality, lacking dynamic complexity, multi-domain diversity, and proper annotations needed for tasks like 4D reconstruction and future prediction.Method: Introduces OmniWorld dataset combining newly collected OmniWorld-Game data with curated public datasets, providing richer modality coverage and realistic dynamic interactions compared to existing synthetic datasets.
Result: Fine-tuning state-of-the-art methods on OmniWorld leads to significant performance gains in 4D reconstruction and video generation tasks, validating its effectiveness as a training resource.
Conclusion: OmniWorld serves as a catalyst for developing general-purpose 4D world models and advancing machines’ understanding of the physical world through improved data availability.
Abstract: The field of 4D world modeling - aiming to jointly capture spatial geometry and temporal dynamics - has witnessed remarkable progress in recent years, driven by advances in large-scale generative models and multimodal learning. However, the development of truly general 4D world models remains fundamentally constrained by the availability of high-quality data. Existing datasets and benchmarks often lack the dynamic complexity, multi-domain diversity, and spatial-temporal annotations required to support key tasks such as 4D geometric reconstruction, future prediction, and camera-control video generation. To address this gap, we introduce OmniWorld, a large-scale, multi-domain, multi-modal dataset specifically designed for 4D world modeling. OmniWorld consists of a newly collected OmniWorld-Game dataset and several curated public datasets spanning diverse domains. Compared with existing synthetic datasets, OmniWorld-Game provides richer modality coverage, larger scale, and more realistic dynamic interactions. Based on this dataset, we establish a challenging benchmark that exposes the limitations of current state-of-the-art (SOTA) approaches in modeling complex 4D environments. Moreover, fine-tuning existing SOTA methods on OmniWorld leads to significant performance gains across 4D reconstruction and video generation tasks, strongly validating OmniWorld as a powerful resource for training and evaluation. We envision OmniWorld as a catalyst for accelerating the development of general-purpose 4D world models, ultimately advancing machines’ holistic understanding of the physical world.
[269] Lost in Translation? Vocabulary Alignment for Source-Free Domain Adaptation in Open-Vocabulary Semantic Segmentation
Silvio Mazzucco, Carl Persson, Mattia Segu, Pier Luigi Dovesi, Federico Tombari, Luc Van Gool, Matteo Poggi
Main category: cs.CV
TL;DR: VocAlign is a source-free domain adaptation framework for VLMs in open-vocabulary semantic segmentation using student-teacher paradigm with vocabulary alignment and LoRA fine-tuning.
Details
Motivation: To enable efficient domain adaptation for vision-language models in open-vocabulary semantic segmentation without requiring source data, addressing computational and memory constraints.Method: Uses student-teacher paradigm with vocabulary alignment strategy, Low-Rank Adaptation (LoRA) for fine-tuning, and Top-K class selection mechanism to reduce memory requirements.
Result: Achieves 6.11 mIoU improvement on CityScapes dataset and superior performance on zero-shot segmentation benchmarks.
Conclusion: Sets new standard for source-free adaptation in open-vocabulary setting with efficient and effective performance.
Abstract: We introduce VocAlign, a novel source-free domain adaptation framework specifically designed for VLMs in open-vocabulary semantic segmentation. Our method adopts a student-teacher paradigm enhanced with a vocabulary alignment strategy, which improves pseudo-label generation by incorporating additional class concepts. To ensure efficiency, we use Low-Rank Adaptation (LoRA) to fine-tune the model, preserving its original capabilities while minimizing computational overhead. In addition, we propose a Top-K class selection mechanism for the student model, which significantly reduces memory requirements while further improving adaptation performance. Our approach achieves a notable 6.11 mIoU improvement on the CityScapes dataset and demonstrates superior performance on zero-shot segmentation benchmarks, setting a new standard for source-free adaptation in the open-vocabulary setting.
[270] Minimal Semantic Sufficiency Meets Unsupervised Domain Generalization
Tan Pan, Kaiyu Guo, Dongli Xu, Zhaorui Tan, Chen Jiang, Deshu Chen, Xin Guo, Brian C. Lovell, Limei Han, Yuan Cheng, Mahsa Baktashmotlagh
Main category: cs.CV
TL;DR: The paper proposes MS-UDG, a method for Unsupervised Domain Generalization that learns Minimal Sufficient Semantic Representations by preserving semantic information while removing irrelevant variations, achieving state-of-the-art performance without category or domain labels.
Details
Motivation: Current unsupervised domain generalization methods often rely on domain labels which are unavailable in real-world scenarios. There's a need to enhance generalization of unsupervised learning models without such supervision.Method: MS-UDG learns Minimal Sufficient Semantic Representations through: (1) InfoNCE-based objective for sufficiency, (2) semantic-variation disentanglement loss, and (3) reconstruction-based mechanism for minimality - all grounded in information theory.
Result: MS-UDG achieves state-of-the-art performance on unsupervised domain-generalization benchmarks, consistently outperforming existing SSL and UDG methods without requiring category or domain labels.
Conclusion: The proposed information-theoretic framework for learning minimal sufficient semantic representations effectively addresses UDG challenges and demonstrates superior generalization capabilities compared to existing approaches.
Abstract: The generalization ability of deep learning has been extensively studied in supervised settings, yet it remains less explored in unsupervised scenarios. Recently, the Unsupervised Domain Generalization (UDG) task has been proposed to enhance the generalization of models trained with prevalent unsupervised learning techniques, such as Self-Supervised Learning (SSL). UDG confronts the challenge of distinguishing semantics from variations without category labels. Although some recent methods have employed domain labels to tackle this issue, such domain labels are often unavailable in real-world contexts. In this paper, we address these limitations by formalizing UDG as the task of learning a Minimal Sufficient Semantic Representation: a representation that (i) preserves all semantic information shared across augmented views (sufficiency), and (ii) maximally removes information irrelevant to semantics (minimality). We theoretically ground these objectives from the perspective of information theory, demonstrating that optimizing representations to achieve sufficiency and minimality directly reduces out-of-distribution risk. Practically, we implement this optimization through Minimal-Sufficient UDG (MS-UDG), a learnable model by integrating (a) an InfoNCE-based objective to achieve sufficiency; (b) two complementary components to promote minimality: a novel semantic-variation disentanglement loss and a reconstruction-based mechanism for capturing adequate variation. Empirically, MS-UDG sets a new state-of-the-art on popular unsupervised domain-generalization benchmarks, consistently outperforming existing SSL and UDG methods, without category or domain labels during representation learning.
[271] Efficient Rectified Flow for Image Fusion
Zirui Wang, Jiayi Zhang, Tianwei Guan, Yuhan Zhou, Xingyuan Li, Minjing Dong, Jinyuan Liu
Main category: cs.CV
TL;DR: RFfusion is an efficient one-step diffusion model for image fusion that uses Rectified Flow to straighten sampling paths and a task-specific VAE architecture to reduce computational complexity while maintaining high fusion quality.
Details
Motivation: Current diffusion models for image fusion require complex computations and redundant inference time, which limits their practical applicability. The authors aim to address these efficiency issues while maintaining high-quality fusion results.Method: Proposes RFfusion with two key innovations: 1) Incorporates Rectified Flow to enable one-step sampling without additional training, and 2) Designs a task-specific VAE where fusion occurs in latent space, using a two-stage training strategy to align VAE objectives with fusion requirements.
Result: Extensive experiments show the method outperforms state-of-the-art methods in both inference speed and fusion quality, achieving efficient one-step sampling while maintaining high-quality results.
Conclusion: RFfusion successfully addresses the efficiency limitations of diffusion models for image fusion through Rectified Flow and task-specific VAE design, enabling practical deployment while maintaining superior fusion performance.
Abstract: Image fusion is a fundamental and important task in computer vision, aiming to combine complementary information from different modalities to fuse images. In recent years, diffusion models have made significant developments in the field of image fusion. However, diffusion models often require complex computations and redundant inference time, which reduces the applicability of these methods. To address this issue, we propose RFfusion, an efficient one-step diffusion model for image fusion based on Rectified Flow. We incorporate Rectified Flow into the image fusion task to straighten the sampling path in the diffusion model, achieving one-step sampling without the need for additional training, while still maintaining high-quality fusion results. Furthermore, we propose a task-specific variational autoencoder (VAE) architecture tailored for image fusion, where the fusion operation is embedded within the latent space to further reduce computational complexity. To address the inherent discrepancy between conventional reconstruction-oriented VAE objectives and the requirements of image fusion, we introduce a two-stage training strategy. This approach facilitates the effective learning and integration of complementary information from multi-modal source images, thereby enabling the model to retain fine-grained structural details while significantly enhancing inference efficiency. Extensive experiments demonstrate that our method outperforms other state-of-the-art methods in terms of both inference speed and fusion quality. Code is available at https://github.com/zirui0625/RFfusion.
[272] Mixture of Noise for Pre-Trained Model-Based Class-Incremental Learning
Kai Jiang, Zhengyan Shi, Dell Zhang, Hongyuan Zhang, Xuelong Li
Main category: cs.CV
TL;DR: The paper proposes Mixture of Noise (Min), a method that learns beneficial noise to mitigate parameter drift in Class Incremental Learning using pre-trained models, achieving state-of-the-art performance.
Details
Motivation: Existing approaches that apply lightweight fine-tuning to pre-trained models in Class Incremental Learning induce parameter drift, which compromises the generalization capability of pre-trained models. Parameter drift acts as noise that obscures critical patterns learned for previous tasks.Method: Min learns task-specific noise from high-dimension features of new tasks, dynamically adjusts weights for optimal mixture of different task noise, and embeds the beneficial noise into intermediate features to mask inefficient patterns. The approach is guided by information theory.
Result: Extensive experiments on six benchmark datasets demonstrate that Min achieves state-of-the-art performance in most incremental settings, with particularly outstanding results in 50-steps incremental settings.
Conclusion: The results show significant potential for beneficial noise in continual learning, successfully mitigating the degradation of backbone generalization when adapting to new tasks.
Abstract: Class Incremental Learning (CIL) aims to continuously learn new categories while retaining the knowledge of old ones. Pre-trained models (PTMs) show promising capabilities in CIL. However, existing approaches that apply lightweight fine-tuning to backbones still induce parameter drift, thereby compromising the generalization capability of pre-trained models. Parameter drift can be conceptualized as a form of noise that obscures critical patterns learned for previous tasks. However, recent researches have shown that noise is not always harmful. For example, the large number of visual patterns learned from pre-training can be easily abused by a single task, and introducing appropriate noise can suppress some low-correlation features, thus leaving a margin for future tasks. To this end, we propose learning beneficial noise for CIL guided by information theory and propose Mixture of Noise (Min), aiming to mitigate the degradation of backbone generalization from adapting new tasks. Specifically, task-specific noise is learned from high-dimension features of new tasks. Then, a set of weights is adjusted dynamically for optimal mixture of different task noise. Finally, Min embeds the beneficial noise into the intermediate features to mask the response of inefficient patterns. Extensive experiments on six benchmark datasets demonstrate that Min achieves state-of-the-art performance in most incremental settings, with particularly outstanding results in 50-steps incremental settings. This shows the significant potential for beneficial noise in continual learning. Code is available at https://github.com/ASCIIJK/MiN-NeurIPS2025.
[273] PolypSeg-GradCAM: Towards Explainable Computer-Aided Gastrointestinal Disease Detection Using U-Net Based Segmentation and Grad-CAM Visualization on the Kvasir Dataset
Akwasi Asare, Ulas Bagci
Main category: cs.CV
TL;DR: PolypSeg-GradCAM is an explainable deep learning framework that combines U-Net with Grad-CAM for transparent polyp segmentation in colonoscopy images, achieving high accuracy while providing interpretable visualizations.
Details
Motivation: Colorectal cancer is a major global health issue, with polyps as critical precursors. Manual polyp segmentation is labor-intensive and subjective, while existing deep learning methods lack interpretability needed for clinical adoption.Method: The framework integrates U-Net architecture with Gradient-weighted Class Activation Mapping (Grad-CAM) for explainable polyp segmentation. Trained and evaluated on the Kvasir-SEG dataset containing 1000 annotated endoscopic images.
Result: Achieved mean Intersection over Union (IoU) of 0.9257 on test set and consistently high Dice coefficients (F-score > 0.96) on training/validation sets. Grad-CAM visualizations confirmed predictions were based on clinically relevant regions.
Conclusion: PolypSeg-GradCAM represents progress toward reliable, trustworthy AI-assisted colonoscopy by combining high segmentation accuracy with interpretability, potentially improving early colorectal cancer prevention.
Abstract: Colorectal cancer (CRC) remains one of the leading causes of cancer-related morbidity and mortality worldwide, with gastrointestinal (GI) polyps serving as critical precursors according to the World Health Organization (WHO). Early and accurate segmentation of polyps during colonoscopy is essential for reducing CRC progression, yet manual delineation is labor-intensive and prone to observer variability. Deep learning methods have demonstrated strong potential for automated polyp analysis, but their limited interpretability remains a barrier to clinical adoption. In this study, we present PolypSeg-GradCAM, an explainable deep learning framework that integrates the U-Net architecture with Gradient-weighted Class Activation Mapping (Grad-CAM) for transparent polyp segmentation. The model was trained and evaluated on the Kvasir-SEG dataset of 1000 annotated endoscopic images. Experimental results demonstrate robust segmentation performance, achieving a mean Intersection over Union (IoU) of 0.9257 on the test set and consistently high Dice coefficients (F-score > 0.96) on training and validation sets. Grad-CAM visualizations further confirmed that predictions were guided by clinically relevant regions, enhancing transparency and trust in the model’s decisions. By coupling high segmentation accuracy with interpretability, PolypSeg-GradCAM represents a step toward reliable, trustworthy AI-assisted colonoscopy and improved early colorectal cancer prevention.
[274] PerceptronCARE: A Deep Learning-Based Intelligent Teleophthalmology Application for Diabetic Retinopathy Diagnosis
Akwasi Asare, Isaac Baffour Senkyire, Emmanuel Freeman, Mary Sagoe, Simon Hilary Ayinedenaba Aluze-Ele, Kelvin Kwao
Main category: cs.CV
TL;DR: PerceptronCARE is a deep learning-based teleophthalmology system for automated diabetic retinopathy detection using retinal images, achieving 85.4% accuracy with optimized neural networks for real-time screening.
Details
Motivation: Diabetic retinopathy is a leading cause of vision loss globally, especially in underserved regions, creating a need for accessible and efficient screening solutions.Method: Developed using multiple convolutional neural networks (ResNet-18, EfficientNet-B0, SqueezeNet) to balance accuracy and computational efficiency, with cloud-based scalability and secure data management.
Result: The system achieved 85.4% accuracy in disease severity classification, enabling real-time screening in clinical and telemedicine settings.
Conclusion: AI-driven telemedicine solutions like PerceptronCARE can expand access to diabetic retinopathy screening, particularly in remote and resource-constrained environments, improving early diagnosis and reducing healthcare costs.
Abstract: Diabetic retinopathy is a leading cause of vision loss among adults and a major global health challenge, particularly in underserved regions. This study presents PerceptronCARE, a deep learning-based teleophthalmology application designed for automated diabetic retinopathy detection using retinal images. The system was developed and evaluated using multiple convolutional neural networks, including ResNet-18, EfficientNet-B0, and SqueezeNet, to determine the optimal balance between accuracy and computational efficiency. The final model classifies disease severity with an accuracy of 85.4%, enabling real-time screening in clinical and telemedicine settings. PerceptronCARE integrates cloud-based scalability, secure patient data management, and a multi-user framework, facilitating early diagnosis, improving doctor-patient interactions, and reducing healthcare costs. This study highlights the potential of AI-driven telemedicine solutions in expanding access to diabetic retinopathy screening, particularly in remote and resource-constrained environments.
[275] Rethinking Pulmonary Embolism Segmentation: A Study of Current Approaches and Challenges with an Open Weight Model
Yixin Zhang, Ryan Chamberlain, Lawrence Ngo, Kevin Kramer, Maciej A. Mazurowski
Main category: cs.CV
TL;DR: Systematic evaluation of 9 segmentation architectures for pulmonary embolism (PE) segmentation from CTPA scans, revealing that 3D U-Net with ResNet encoder performs best, CNNs outperform ViTs, and pretraining can hurt performance.
Details
Motivation: To conduct a comprehensive performance audit of various segmentation architectures for PE segmentation and understand the factors affecting performance in this clinically important task.Method: Used a densely annotated in-house dataset of 490 CTPA scans to evaluate 9 segmentation architectures (CNN and ViT families) with pretrained vs random weights under unified testing framework.
Result: Best model achieved mean Dice score of 0.7131, detecting 181 emboli with 49 false positives and 28 false negatives. 3D models, particularly 3D U-Net with ResNet encoder, performed best. CNNs outperformed ViTs, and pretraining sometimes hurt performance.
Conclusion: 3D CNN architectures remain most effective for PE segmentation, with consistent performance patterns across models. Distal emboli remain challenging, and PE classification/segmentation may rely on different features.
Abstract: In this study, we curated a densely annotated in-house dataset comprising 490 CTPA scans. Using this dataset, we systematically evaluated nine widely used segmentation architectures from both the CNN and Vision Transformer (ViT) families, initialized with either pretrained or random weights, under a unified testing framework as a performance audit. Our study leads to several important observations: (1) 3D U-Net with a ResNet encoder remains a highly effective architecture for PE segmentation; (2) 3D models are particularly well-suited to this task given the morphological characteristics of emboli; (3) CNN-based models generally yield superior performance compared to their ViT-based counterparts in PE segmentation; (4) classification-based pretraining, even on large PE datasets, can adversely impact segmentation performance compared to training from scratch, suggesting that PE classification and segmentation may rely on different sets of discriminative features; (5) different model architectures show a highly consistent pattern of segmentation performance when trained on the same data; and (6) while central and large emboli can be segmented with satisfactory accuracy, distal emboli remain challenging due to both task complexity and the scarcity of high-quality datasets. Besides these findings, our best-performing model achieves a mean Dice score of 0.7131 for segmentation. It detects 181 emboli with 49 false positives and 28 false negatives from 60 in-house testing scans. Its generalizability is further validated on public datasets.
[276] Understanding-in-Generation: Reinforcing Generative Capability of Unified Model via Infusing Understanding into Generation
Yuanhuiyi Lyu, Chi Kit Wong, Chenfei Liao, Lutao Jiang, Xu Zheng, Zexin Lu, Linfeng Zhang, Xuming Hu
Main category: cs.CV
TL;DR: UiG is a novel reasoning framework that integrates understanding capabilities into the generation process for text-to-image models, using image editing as a bridge to enhance generation quality.
Details
Motivation: Current Chain-of-Thought methods separate understanding and generation processes, limiting their ability to guide unified models in addressing generative deficiencies.Method: Proposes Understanding-in-Generation (UiG) framework that uses image editing as a bridge to infuse understanding into generation. It verifies generated images, incorporates model understanding into editing instructions, and enhances images step-by-step.
Result: Significant performance improvement in text-to-image generation, achieving 3.92% gain on long prompt setting of TIIF benchmark compared to existing methods.
Conclusion: UiG effectively integrates understanding capabilities into the generation process, demonstrating superior performance over traditional reasoning methods for unified text-to-image models.
Abstract: Recent works have made notable advancements in enhancing unified models for text-to-image generation through the Chain-of-Thought (CoT). However, these reasoning methods separate the processes of understanding and generation, which limits their ability to guide the reasoning of unified models in addressing the deficiencies of their generative capabilities. To this end, we propose a novel reasoning framework for unified models, Understanding-in-Generation (UiG), which harnesses the robust understanding capabilities of unified models to reinforce their performance in image generation. The core insight of our UiG is to integrate generative guidance by the strong understanding capabilities during the reasoning process, thereby mitigating the limitations of generative abilities. To achieve this, we introduce “Image Editing” as a bridge to infuse understanding into the generation process. Initially, we verify the generated image and incorporate the understanding of unified models into the editing instructions. Subsequently, we enhance the generated image step by step, gradually infusing the understanding into the generation process. Our UiG framework demonstrates a significant performance improvement in text-to-image generation over existing text-to-image reasoning methods, e.g., a 3.92% gain on the long prompt setting of the TIIF benchmark. The project code: https://github.com/QC-LY/UiG
[277] COLT: Enhancing Video Large Language Models with Continual Tool Usage
Yuyang Liu, Xinyuan Shi, Xiaondan Liang
Main category: cs.CV
TL;DR: COLT enhances open-source video LLMs with continuous tool usage capability, enabling automatic acquisition of tool-use abilities in evolving tool streams without catastrophic forgetting.
Details
Motivation: Existing video LLM methods assume fixed tool repositories and struggle with real-world environments where tool data is perpetually evolving and streaming in.Method: COLT incorporates a learnable tool codebook as tool-specific memory, dynamically selecting relevant tools based on similarity between user instruction and tool features. Uses VideoToolBench dataset for instruction tuning.
Result: Extensive experiments on video LLM benchmarks and VideoToolBench demonstrate state-of-the-art performance.
Conclusion: COLT successfully enables continuous tool usage in video LLMs, overcoming limitations of existing methods in dynamic tool environments.
Abstract: The success of Large Language Models (LLMs) has significantly propelled the research of video understanding. To harvest the benefits of well-trained expert models (i.e., tools), video LLMs prioritize the exploration of tool usage capabilities. Existing methods either prompt closed-source LLMs or employ the instruction tuning paradigm for tool-use fine-tuning. These methods, however, assume an established repository of fixed tools and struggle to generalize to real-world environments where tool data is perpetually evolving and streaming in. To this end, we propose to enhance open-source video LLMs with COntinuaL Tool usage (termed COLT), which automatically acquires tool-use ability in a successive tool stream without suffering ‘catastrophic forgetting’ of the past learned tools. Specifically, our COLT incorporates a learnable tool codebook as a tool-specific memory system. Then relevant tools are dynamically selected based on the similarity between user instruction and tool features within the codebook. To unleash the tool usage potential of video LLMs, we collect a video-centric tool-use instruction tuning dataset VideoToolBench. Extensive experiments on both previous video LLM benchmarks and the tool-use-specific VideoToolBench dataset demonstrate the state-of-the-art performance of our proposed COLT.
[278] Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning
Guoxin Wang, Jun Zhao, Xinyi Liu, Yanbo Liu, Xuyang Cao, Chao Li, Zhuoyun Liu, Qintian Sun, Fangru Zhou, Haoqiang Xing, Zhenhong Yang
Main category: cs.CV
TL;DR: Citrus-V is a multimodal medical foundation model that integrates detection, segmentation, and chain-of-thought reasoning for comprehensive medical image analysis and diagnostic inference in a single framework.
Details
Motivation: Existing medical imaging models are narrowly focused and require multiple specialized networks, limiting generalization. Clinical applications demand precise visual grounding, multimodal integration, and chain-of-thought reasoning that current models lack.Method: Proposes a novel multimodal training approach combining image analysis with textual reasoning. Integrates detection, segmentation, and multimodal chain-of-thought reasoning in a single framework. Releases curated open-source data suite covering reasoning, detection, segmentation, and document understanding tasks.
Result: Citrus-V outperforms existing open-source medical models and expert-level imaging systems across multiple benchmarks. Enables pixel-level lesion localization, structured report generation, and physician-like diagnostic inference.
Conclusion: The model delivers a unified pipeline from visual grounding to clinical reasoning, supporting precise lesion quantification, automated reporting, and reliable second opinions for clinical applications.
Abstract: Medical imaging provides critical evidence for clinical diagnosis, treatment planning, and surgical decisions, yet most existing imaging models are narrowly focused and require multiple specialized networks, limiting their generalization. Although large-scale language and multimodal models exhibit strong reasoning and multi-task capabilities, real-world clinical applications demand precise visual grounding, multimodal integration, and chain-of-thought reasoning. We introduce Citrus-V, a multimodal medical foundation model that combines image analysis with textual reasoning. The model integrates detection, segmentation, and multimodal chain-of-thought reasoning, enabling pixel-level lesion localization, structured report generation, and physician-like diagnostic inference in a single framework. We propose a novel multimodal training approach and release a curated open-source data suite covering reasoning, detection, segmentation, and document understanding tasks. Evaluations demonstrate that Citrus-V outperforms existing open-source medical models and expert-level imaging systems across multiple benchmarks, delivering a unified pipeline from visual grounding to clinical reasoning and supporting precise lesion quantification, automated reporting, and reliable second opinions.
[279] Investigating Traffic Accident Detection Using Multimodal Large Language Models
Ilhan Skender, Kailin Tong, Selim Solmaz, Daniel Watzenig
Main category: cs.CV
TL;DR: This paper investigates zero-shot capabilities of multimodal large language models (MLLMs) for traffic accident detection using infrastructure camera images, achieving promising results without fine-tuning by integrating advanced visual analytics.
Details
Motivation: Traffic safety requires timely accident detection, and infrastructure-based vision sensors offer scalable solutions. The research aims to minimize reliance on extensive labeled datasets by leveraging MLLMs' zero-shot capabilities.Method: Evaluated MLLMs (Gemini 1.5/2.0, Gemma 3, Pixtral) on simulated DeepAccident dataset from CARLA, using enhanced prompts with YOLO for object detection, Deep SORT for tracking, and SAM for segmentation to improve accuracy.
Result: Pixtral achieved best performance with 71% F1-score and 83% recall. Gemini models gained precision (up to 90%) but suffered F1/recall losses. Gemma 3 showed most balanced performance with minimal metric fluctuation.
Conclusion: Integrating MLLMs with advanced visual analytics demonstrates substantial potential for real-world automated traffic monitoring systems, enhancing applicability without extensive training data.
Abstract: Traffic safety remains a critical global concern, with timely and accurate accident detection essential for hazard reduction and rapid emergency response. Infrastructure-based vision sensors offer scalable and efficient solutions for continuous real-time monitoring, facilitating automated detection of accidents directly from captured images. This research investigates the zero-shot capabilities of multimodal large language models (MLLMs) for detecting and describing traffic accidents using images from infrastructure cameras, thus minimizing reliance on extensive labeled datasets. Main contributions include: (1) Evaluation of MLLMs using the simulated DeepAccident dataset from CARLA, explicitly addressing the scarcity of diverse, realistic, infrastructure-based accident data through controlled simulations; (2) Comparative performance analysis between Gemini 1.5 and 2.0, Gemma 3 and Pixtral models in accident identification and descriptive capabilities without prior fine-tuning; and (3) Integration of advanced visual analytics, specifically YOLO for object detection, Deep SORT for multi-object tracking, and Segment Anything (SAM) for instance segmentation, into enhanced prompts to improve model accuracy and explainability. Key numerical results show Pixtral as the top performer with an F1-score of 71% and 83% recall, while Gemini models gained precision with enhanced prompts (e.g., Gemini 1.5 rose to 90%) but suffered notable F1 and recall losses. Gemma 3 offered the most balanced performance with minimal metric fluctuation. These findings demonstrate the substantial potential of integrating MLLMs with advanced visual analytics techniques, enhancing their applicability in real-world automated traffic monitoring systems.
[280] Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation
Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, Jason Kuen
Main category: cs.CV
TL;DR: Lavida-O is a unified Masked Diffusion Model that supports multimodal understanding and generation tasks including image-level understanding, object grounding, image editing, and high-resolution text-to-image synthesis.
Details
Motivation: Existing multimodal MDMs like MMaDa and Muddit only support simple image-level understanding and low-resolution generation, lacking comprehensive multimodal capabilities.Method: Uses Elastic Mixture-of-Transformers architecture with lightweight generation branch and larger understanding branch, incorporating token compression, universal text conditioning, stratified sampling, and planning with iterative self-reflection.
Result: Achieves state-of-the-art performance on RefCOCO object grounding, GenEval text-to-image generation, and ImgEdit image editing benchmarks, outperforming Qwen2.5-VL and FluxKontext-dev with inference speedup.
Conclusion: Lavida-O establishes a new paradigm for scalable multimodal reasoning and generation through its unified framework.
Abstract: We propose Lavida-O, a unified Masked Diffusion Model (MDM) for multimodal understanding and generation. Unlike existing multimodal MDMs such as MMaDa and Muddit which only support simple image-level understanding tasks and low-resolution image generation, Lavida-O presents a single framework that enables image-level understanding, object grounding, image editing, and high-resolution (1024px) text-to-image synthesis. Lavida-O incorporates a novel Elastic Mixture-of-Transformers (Elastic-MoT) architecture that couples a lightweight generation branch with a larger understanding branch, supported by token compression, universal text conditioning and stratified sampling for efficient and high-quality generation. Lavida-O further incorporates planning and iterative self-reflection in image generation and editing tasks, seamlessly boosting generation quality with its understanding capabilities. Lavida-O achieves state-of-the-art performance on a wide range of benchmarks including RefCOCO object grounding, GenEval text-to-image generation, and ImgEdit image editing, outperforming existing autoregressive models and continuous diffusion models such as Qwen2.5-VL and FluxKontext-dev, while offering considerable speedup at inference. These advances establish Lavida-O as a new paradigm for scalable multimodal reasoning and generation.
cs.AI
[281] The Indispensable Role of User Simulation in the Pursuit of AGI
Krisztian Balog, ChengXiang Zhai
Main category: cs.AI
TL;DR: User simulation is critical for overcoming AGI development bottlenecks by providing scalable evaluation environments and generating interaction data for training adaptive agents.
Details
Motivation: AGI development faces bottlenecks in evaluating complex interactive systems and acquiring sufficient interaction data for training. User simulation can address these challenges.Method: Proposes using computational agents that mimic human interaction with AI systems to create realistic simulators for evaluation and data generation.
Result: Identifies that user simulation technology and intelligent task agents are synergistic and must advance together to accelerate AGI development.
Conclusion: Research into user simulation is essential for AGI progress, requiring interdisciplinary collaboration to build realistic simulators and address challenges like those posed by large language models.
Abstract: Progress toward Artificial General Intelligence (AGI) faces significant bottlenecks, particularly in rigorously evaluating complex interactive systems and acquiring the vast interaction data needed for training adaptive agents. This paper posits that user simulation – creating computational agents that mimic human interaction with AI systems – is not merely a useful tool, but is a critical catalyst required to overcome these bottlenecks and accelerate AGI development. We argue that realistic simulators provide the necessary environments for scalable evaluation, data generation for interactive learning, and fostering the adaptive capabilities central to AGI. Therefore, research into user simulation technology and intelligent task agents are deeply synergistic and must advance hand-in-hand. This article elaborates on the critical role of user simulation for AGI, explores the interdisciplinary nature of building realistic simulators, identifies key challenges including those posed by large language models, and proposes a future research agenda.
[282] Evaluation-Aware Reinforcement Learning
Shripad Vilasrao Deshmukh, Will Schwarzer, Scott Niekum
Main category: cs.AI
TL;DR: Evaluation-Aware Reinforcement Learning (EvA-RL) trains policies to maximize return while minimizing evaluation error, enabling accurate policy assessment with limited data by co-learning value predictors.
Details
Motivation: Standard RL suffers from high variance (limited data, long horizons) and high bias (unequal support, inaccurate models) in policy evaluation. Current approaches treat evaluation as an afterthought rather than a core training objective.Method: EvA-RL framework trains policies to be ’easy to evaluate’ by minimizing expected evaluation error under a given value prediction scheme. Extends to co-learn assessment-conditioned state-value predictors alongside policies to mitigate performance-evaluation tradeoffs.
Result: Empirical results across discrete and continuous domains show EvA-RL substantially reduces evaluation error while maintaining competitive returns, enabling accurate evaluation with small rollout budgets.
Conclusion: EvA-RL establishes a new class of RL methods that treat reliable evaluation as a first-class principle, laying foundation for safety-critical systems requiring trustworthy performance assessment.
Abstract: Policy evaluation is often a prerequisite for deploying safety- and performance-critical systems. Existing evaluation approaches frequently suffer from high variance due to limited data and long-horizon tasks, or high bias due to unequal support or inaccurate environmental models. We posit that these challenges arise, in part, from the standard reinforcement learning (RL) paradigm of policy learning without explicit consideration of evaluation. As an alternative, we propose evaluation-aware reinforcement learning (EvA-RL), in which a policy is trained to maximize expected return while simultaneously minimizing expected evaluation error under a given value prediction scheme – in other words, being “easy” to evaluate. We formalize a framework for EvA-RL and design an instantiation that enables accurate policy evaluation, conditioned on a small number of rollouts in an assessment environment that can be different than the deployment environment. However, our theoretical analysis and empirical results show that there is often a tradeoff between evaluation accuracy and policy performance when using a fixed value-prediction scheme within EvA-RL. To mitigate this tradeoff, we extend our approach to co-learn an assessment-conditioned state-value predictor alongside the policy. Empirical results across diverse discrete and continuous action domains demonstrate that EvA-RL can substantially reduce evaluation error while maintaining competitive returns. This work lays the foundation for a broad new class of RL methods that treat reliable evaluation as a first-class principle during training.
[283] Estimating the Self-Consistency of LLMs
Robert Nowak
Main category: cs.AI
TL;DR: Analysis of self-consistency estimator for LLMs under fixed compute budget, showing optimal split between prompts and repetitions is proportional to square root of budget.
Details
Motivation: Systems often repeat prompts to LLMs and aggregate responses to improve reliability, but need to understand the tradeoffs between sampling more prompts vs. more repetitions under fixed compute constraints.Method: Mathematical analysis of an estimator for LLM self-consistency under fixed compute budget B=mn, where m is number of prompts and n is number of repetitions per prompt.
Result: The analysis reveals that the optimal tradeoff favors a rough split where both m and n are proportional to the square root of the total budget B.
Conclusion: For fixed compute budgets, the optimal strategy is to balance the number of prompts sampled and repetitions per prompt, with both quantities scaling as the square root of the total compute available.
Abstract: Systems often repeat the same prompt to large language models (LLMs) and aggregate responses to improve reliability. This short note analyzes an estimator of the self-consistency of LLMs and the tradeoffs it induces under a fixed compute budget $B=mn$, where $m$ is the number of prompts sampled from the task distribution and $n$ is the number of repeated LLM calls per prompt; the resulting analysis favors a rough split $m,n\propto\sqrt{B}$.
[284] Cognitive Load Limits in Large Language Models: Benchmarking Multi-Hop Reasoning
Sai Teja Reddy Adapala
Main category: cs.AI
TL;DR: The paper introduces a formal theory of computational cognitive load in LLMs, identifying Context Saturation and Attentional Residue as key mechanisms that degrade reasoning performance. It presents the ICE benchmark to systematically test these factors and reveals significant performance variations across models, with smaller architectures showing complete brittleness while larger models demonstrate partial resilience.
Details
Motivation: There's a critical gap between LLMs' performance on static benchmarks and their fragility in dynamic, information-rich environments. The computational limits governing reasoning under cognitive load remain poorly understood, particularly how extraneous information and task-switching interfere with performance.Method: Developed the Interleaved Cognitive Evaluation (ICE) benchmark to systematically manipulate cognitive load factors (Context Saturation and Attentional Residue) on multi-hop reasoning tasks. Conducted a comprehensive study with N=10 replications per item across 200 questions, testing five instruction-tuned models.
Result: Smaller models (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.2) showed complete brittleness with 0% accuracy across all conditions. Gemini-2.0-Flash-001 showed partial resilience (85% accuracy in controls) with statistically significant degradation under context saturation (β=-0.003 per % load, p<0.001).
Conclusion: Cognitive load is a key contributor to reasoning failures, supporting theories of hallucination-as-guessing under uncertainty. Dynamic, cognitive-aware stress testing like ICE is essential for evaluating true resilience and safety of advanced AI systems.
Abstract: The scaling of Large Language Models (LLMs) has exposed a critical gap between their performance on static benchmarks and their fragility in dynamic, information-rich environments. While models excel at isolated tasks, the computational limits that govern their reasoning under cognitive load remain poorly understood. In this work, we introduce a formal theory of computational cognitive load, positing that extraneous, task-irrelevant information (Context Saturation) and interference from task-switching (Attentional Residue) are key mechanisms that degrade performance. We designed the Interleaved Cognitive Evaluation (ICE), a deconfounded benchmark to systematically manipulate these load factors on challenging multi-hop reasoning tasks. A comprehensive study (N = 10 replications per item across 200 questions) revealed significant performance variations across five instruction-tuned models. Smaller open-source architectures (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.2) exhibited baseline brittleness, achieving 0% accuracy (SEM = 0.0) across all conditions, including clean controls, on this high-intrinsic-load task. In contrast, Gemini-2.0-Flash-001 showed partial resilience, achieving 85% accuracy in control conditions, with a statistically significant degradation under context saturation ($\beta = -0.003$ per % load, $p < 0.001$). These findings provide preliminary evidence that cognitive load is a key contributor to reasoning failures, supporting theories of hallucination-as-guessing under uncertainty. We conclude that dynamic, cognitive-aware stress testing, as exemplified by the ICE benchmark, is essential for evaluating the true resilience and safety of advanced AI systems.
[285] Score the Steps, Not Just the Goal: VLM-Based Subgoal Evaluation for Robotic Manipulation
Ramy ElMallah, Krish Chhajer, Chi-Guhn Lee
Main category: cs.AI
TL;DR: StepEval is a proposed evaluation framework that uses vision-language models to automatically assess subgoal success in robot manipulation tasks, moving beyond binary success rates to provide detailed per-subgoal performance vectors.
Details
Motivation: Current robot learning papers only report binary success rates, which obscures where policies succeed or fail along multi-step tasks. There's a need for subgoal-level reporting to make partial competence visible.Method: StepEval uses vision-language models as automated judges of subgoal outcomes from recorded images/videos. It’s designed as a cost-aware, plug-in framework with per-subgoal success rate vectors as primary evaluation artifacts.
Result: The paper proposes design principles for a scalable, community-driven open-source project that can remain model-agnostic, support various input types, and be lightweight enough for widespread adoption.
Conclusion: The framework aims to establish standard, reproducible practices for scoring task steps rather than just final goals, inviting open-source contributions to make subgoal-level evaluation routine in robot learning.
Abstract: Robot learning papers typically report a single binary success rate (SR), which obscures where a policy succeeds or fails along a multi-step manipulation task. We argue that subgoal-level reporting should become routine: for each trajectory, a vector of per-subgoal SRs that makes partial competence visible (e.g., grasp vs. pour). We propose a blueprint for StepEval, a cost-aware plug-in evaluation framework that utilizes vision-language models (VLMs) as automated judges of subgoal outcomes from recorded images or videos. Rather than proposing new benchmarks or APIs, our contribution is to outline design principles for a scalable, community-driven open-source project. In StepEval, the primary artifact for policy evaluation is the per-subgoal SR vector; however, other quantities (e.g., latency or cost estimates) are also considered for framework-optimization diagnostics to help the community tune evaluation efficiency and accuracy when ground-truth subgoal success labels are available. We discuss how such a framework can remain model-agnostic, support single- or multi-view inputs, and be lightweight enough to adopt across labs. The intended contribution is a shared direction: a minimal, extensible seed that invites open-source contributions, so that scoring the steps, not just the final goal, becomes a standard and reproducible practice.
[286] Nano Bio-Agents (NBA): Small Language Model Agents for Genomics
George Hong, Daniel Trejo Banos
Main category: cs.AI
TL;DR: Small Language Models (SLMs) under 10B parameters combined with an agentic framework achieve high accuracy (85-98%) in genomics QA while reducing computational costs and addressing hallucination issues.
Details
Motivation: To address hallucination problems and high computational costs associated with using large language models for genomics question answering, while enabling more efficient and accessible ML-powered genomics tools.Method: Developed Nano Bio-Agent (NBA) framework that incorporates task decomposition, tool orchestration, and API access to systems like NCBI and AlphaGenome, applied to small language models (3-10B parameters).
Result: SLMs with agentic framework achieved 98% accuracy on GeneTuring benchmark, with 3-10B parameter models consistently achieving 85-97% accuracy while requiring significantly lower computational resources than conventional approaches.
Conclusion: Small language models combined with agentic frameworks offer promising potential for efficiency gains, cost savings, and democratization of ML-powered genomics tools while maintaining robust and accurate performance comparable to larger models.
Abstract: We investigate the application of Small Language Models (<10 billion parameters) for genomics question answering via agentic framework to address hallucination issues and computational cost challenges. The Nano Bio-Agent (NBA) framework we implemented incorporates task decomposition, tool orchestration, and API access into well-established systems such as NCBI and AlphaGenome. Results show that SLMs combined with such agentic framework can achieve comparable and in many cases superior performance versus existing approaches utilising larger models, with our best model-agent combination achieving 98% accuracy on the GeneTuring benchmark. Notably, small 3-10B parameter models consistently achieve 85-97% accuracy while requiring much lower computational resources than conventional approaches. This demonstrates promising potential for efficiency gains, cost savings, and democratization of ML-powered genomics tools while retaining highly robust and accurate performance.
[287] What Does Your Benchmark Really Measure? A Framework for Robust Inference of AI Capabilities
Nathanael Jo, Ashia Wilson
Main category: cs.AI
TL;DR: The paper proposes a principled framework for AI evaluation as inference, addressing reliability issues in benchmark assessments by treating evaluations as inferences based on theories of capability rather than simple measurements.
Details
Motivation: Growing skepticism about the reliability of generative model evaluations on benchmark data, as current methods treat benchmark scores as direct measurements rather than inferences based on theoretical assumptions about capability.Method: Develops a framework that begins from a theory of capability and derives methods for estimating it, including addressing sensitivity to perturbations through uncertainty accounting and an adaptive algorithm that reduces sample complexity.
Result: The framework provides more reliable and trustworthy estimates of AI capabilities by explicitly modeling evaluation as inference and accounting for uncertainty factors.
Conclusion: This approach lays the groundwork for more principled AI evaluation methods that can better reflect true model capabilities and address current reliability concerns in benchmark assessments.
Abstract: Evaluations of generative models on benchmark data are now ubiquitous, and their outcomes critically shape public and scientific expectations of AI’s capabilities. Yet growing skepticism surrounds their reliability. How can we know that a reported accuracy genuinely reflects a model’s true performance? Evaluations are often presented as simple measurements, but in reality they are inferences: to treat benchmark scores as evidence of capability is already to assume a theory of what capability is and how it manifests in a test. We make this step explicit by proposing a principled framework for evaluation as inference: begin from a theory of capability, and then derive methods for estimating it. This perspective, familiar in fields such as psychometrics, has not yet become commonplace in AI evaluation. As a proof of concept, we address a central challenge that undermines reliability: sensitivity to perturbations. After formulating a model of ability, we introduce methods that infer ability while accounting for uncertainty from sensitivity and finite samples, including an adaptive algorithm that significantly reduces sample complexity. Together, these contributions lay the groundwork for more reliable and trustworthy estimates of AI capabilities as measured through benchmarks.
[288] SteinerSQL: Graph-Guided Mathematical Reasoning for Text-to-SQL Generation
Xutao Mao, Tao Liu, Hongying Zan
Main category: cs.AI
TL;DR: SteinerSQL is a unified framework that addresses complex Text-to-SQL queries by combining mathematical reasoning and schema navigation into a single graph optimization problem, achieving state-of-the-art results on challenging benchmarks.
Details
Motivation: LLMs struggle with complex Text-to-SQL queries requiring both mathematical reasoning and schema navigation. Existing methods handle these challenges separately, leading to fractured reasoning processes that compromise logical and structural correctness.Method: SteinerSQL operates in three stages: mathematical decomposition to identify required tables (terminals), optimal reasoning scaffold construction via a Steiner tree problem, and multi-level validation to ensure correctness.
Result: On LogicCat and Spider2.0-Lite benchmarks, SteinerSQL achieves 36.10% and 40.04% execution accuracy respectively using Gemini-2.5-Pro, establishing new state-of-the-art performance.
Conclusion: SteinerSQL presents a new unified paradigm for Text-to-SQL that paves the way for more robust and principled solutions to complex reasoning tasks.
Abstract: Large Language Models (LLMs) struggle with complex Text-to-SQL queries that demand both sophisticated mathematical reasoning and intricate schema navigation. Existing methods often tackle these challenges in isolation, creating a fractured reasoning process that compromises logical and structural correctness. To resolve this, we introduce SteinerSQL, a framework that unifies these dual challenges into a single, graph-centric optimization problem. SteinerSQL operates in three stages: mathematical decomposition to identify required tables (terminals), optimal reasoning scaffold construction via a Steiner tree problem, and multi-level validation to ensure correctness. On the challenging LogicCat and Spider2.0-Lite benchmarks, SteinerSQL establishes a new state-of-the-art with 36.10% and 40.04% execution accuracy, respectively, using Gemini-2.5-Pro. Beyond accuracy, SteinerSQL presents a new, unified paradigm for Text-to-SQL, paving the way for more robust and principled solutions to complex reasoning tasks.
[289] GRAFT: GRaPH and Table Reasoning for Textual Alignment – A Benchmark for Structured Instruction Following and Visual Reasoning
Abhigya Verma, Sriram Puttagunta, Seganrasan Subramanian, Sravan Ramachandran
Main category: cs.AI
TL;DR: GRAFT is a structured multimodal benchmark for evaluating models on visual reasoning tasks using programmatically generated charts and tables with systematically created analytical questions and structured answer formats.
Details
Motivation: To provide a unified, scalable framework for fine-grained benchmarking of multimodal models on visually grounded, structured reasoning tasks, addressing the need for comprehensive evaluation standards in this field.Method: Programmatically generates charts and synthetically rendered tables using Python visualization libraries, pairs them with multi-step analytical questions based solely on visual content, and provides answers in structured formats (JSON/YAML) with a taxonomy of reasoning types.
Result: Creates a benchmark that enables consistent evaluation of instruction-following, visual reasoning, and visual-textual alignment tasks through precise, aspect-based assessment following strict factual and formatting guidelines.
Conclusion: GRAFT sets a new evaluation standard by offering a comprehensive framework for assessing multimodal models on structured reasoning tasks with controlled data semantics and systematic question-answer generation.
Abstract: GRAFT is a structured multimodal benchmark for evaluating models on instruction-following, visual reasoning, and visual-textual alignment tasks. It features programmatically generated charts and synthetically rendered tables, created with Python visualization libraries to ensure control over data semantics, structure, and clarity. Each GRAFT instance pairs a chart or table image with a systematically generated, multi-step analytical question based solely on visual content. Answers are provided in structured formats such as JSON or YAML, supporting consistent evaluation of both reasoning and output format. The benchmark introduces a taxonomy of reasoning types including comparison, trend identification, ranking, aggregation, proportion estimation, and anomaly detection to enable comprehensive assessment. Reference answers follow strict factual and formatting guidelines for precise, aspect-based evaluation. GRAFT offers a unified, scalable framework for fine-grained benchmarking of multimodal models on visually grounded, structured reasoning tasks, setting a new evaluation standard in this field.
[290] Calibrated Reasoning: An Explanatory Verifier for Dynamic and Efficient Problem-Solving
Anisha Garg, Engin Tekin, Yash More, David Bick, Nishit Neema, Ganesh Venkatesh
Main category: cs.AI
TL;DR: A pairwise Explanatory Verifier trained with reinforcement learning (GRPO) produces calibrated confidence scores and natural language reasoning to improve test-time computing strategies like best-of-n and self-reflection.
Details
Motivation: Advanced test-time computing strategies are limited by models' poor self-evaluation capabilities, which caps their effectiveness in scaling reasoning models.Method: Proposes a pairwise Explanatory Verifier trained via reinforcement learning (GRPO) that generates calibrated confidence scores and associated natural language reasoning for generated solutions.
Result: The verifier improves accuracy and efficiency of test-time strategies and excels at identifying challenging failure modes where standard methods like majority voting fail, particularly when both candidate solutions are identically incorrect.
Conclusion: The pairwise Explanatory Verifier with GRPO training effectively addresses self-evaluation limitations in reasoning models, enabling more robust test-time computing strategies.
Abstract: Advanced test-time computing strategies are essential for scaling reasoning models, but their effectiveness is capped by the models’ poor self-evaluation. We propose a pairwise Explanatory Verifier, trained via reinforcement learning (GRPO), that produces calibrated confidence scores and associated natural language reasoning for generated solutions. Our verifier improves the accuracy and efficiency of test-time strategies like best-of-n and self-reflection. Crucially, it excels at identifying challenging failure modes, such as when both candidate solutions are identically incorrect, succeeding where standard methods like majority voting fail.
[291] UserRL: Training Interactive User-Centric Agent via Reinforcement Learning
Cheng Qian, Zuxin Liu, Akshara Prabhakar, Jielin Qiu, Zhiwei Liu, Haolin Chen, Shirley Kokane, Heng Ji, Weiran Yao, Shelby Heinecke, Silvio Savarese, Caiming Xiong, Huan Wang
Main category: cs.AI
TL;DR: UserRL is a unified framework for training and evaluating user-centric AI agents through standardized gym environments with simulated users, focusing on reward shaping and trajectory scoring to improve multi-turn interactions.
Details
Motivation: Current RL-trained agentic models face challenges in assisting real users due to the diversity and dynamics of user interactions, requiring better training and evaluation methods for user-centric abilities.Method: Proposes UserRL framework with standardized gym environments and simulated users, systematically varying turn-level reward assignment and trajectory-level score calculation using GRPO algorithm, tested on Qwen3 models.
Result: Three key findings: (1) SFT cold start is critical for initial interaction ability and sustained RL improvements; (2) deliberate trajectory scoring enables more efficient multi-turn interactions; (3) open-source simulators like Qwen3-32B are cost-effective alternatives to stronger simulators like GPT-4o.
Conclusion: Careful design of reward shaping and user simulation choice is as crucial as model scale, establishing UserRL as a practical pathway for developing robust user-centric agentic models.
Abstract: Reinforcement learning (RL) has shown promise in training agentic models that move beyond static benchmarks to engage in dynamic, multi-turn interactions. Yet, the ultimate value of such agents lies in their ability to assist users, a setting where diversity and dynamics of user interaction pose challenges. In this work, we propose UserRL, a unified framework for training and evaluating user-centric abilities through standardized gym environments paired with simulated users. We systematically vary turn-level reward assignment and trajectory-level score calculation to analyze how different formulations affect learning under the GRPO algorithm. Our experiments across Qwen3 models reveal three key findings: (i) SFT cold start is critical for unlocking initial interaction ability and enabling sustained RL improvements; (ii) deliberate trajectory scoring yields more efficient and effective multi-turn interactions; and (iii) while stronger simulated users (e.g., GPT-4o) facilitates training, open-source simulators (e.g., Qwen3-32B) remain a cost-effective and transferable option. Together, these results highlight that careful design of reward shaping and user simulation choice is as crucial as model scale, and establish UserRL as a practical pathway for developing robust user-centric agentic models. All codes and data are public for future research.
[292] Multi-Modal Artificial Intelligence of Embryo Grading and Pregnancy Prediction in Assisted Reproductive Technology: A Review
Xueqiang Ouyang, Jia Wei
Main category: cs.AI
TL;DR: This paper reviews AI applications in embryo grading and pregnancy prediction for assisted reproductive technology, focusing on multi-modal data integration from static images, time-lapse videos, and structured data.
Details
Motivation: To address challenges in conventional IVF-ET procedures, including subjective embryo grading and inefficient multi-modal data integration, by leveraging AI technologies to improve pregnancy success rates.Method: Comprehensive review of AI applications organized by data modalities (static images, time-lapse videos, structured tabular data), analyzing multi-modal feature fusion, data scarcity issues, model generalization, and regulatory frameworks.
Result: The review identifies core challenges in current research and provides insights into how different data modalities contribute to embryo assessment and pregnancy prediction in ART.
Conclusion: The paper outlines future research directions for advancing multi-modal AI applications in ART, offering actionable guidance to overcome existing limitations and improve clinical outcomes.
Abstract: Infertility, a pressing global health concern, affects a substantial proportion of individuals worldwide. While advancements in assisted reproductive technology (ART) have offered effective interventions, conventional in vitro fertilization-embryo transfer (IVF-ET) procedures still encounter significant hurdles in enhancing pregnancy success rates. Key challenges include the inherent subjectivity in embryo grading and the inefficiency of multi-modal data integration. Against this backdrop, the adoption of AI-driven technologies has emerged as a pivotal strategy to address these issues. This article presents a comprehensive review of the progress in AI applications for embryo grading and pregnancy prediction from a novel perspective, with a specific focus on the utilization of different modal data, such as static images, time-lapse videos, and structured tabular data. The reason for this perspective is that reorganizing tasks based on data sources can not only more accurately depict the essence of the problem but also help clarify the rationality and limitations of model design. Furthermore, this review critically examines the core challenges in contemporary research, encompassing the intricacies of multi-modal feature fusion, constraints imposed by data scarcity, limitations in model generalization capabilities, and the dynamically evolving legal and regulatory frameworks. On this basis, it explicitly identifies potential avenues for future research, aiming to provide actionable guidance for advancing the application of multi-modal AI in the field of ART.
[293] The Conductor and the Engine: A Path Towards Co-Designed Reasoning
Yuanxin Wang, Pawel Filipczuk, Anisha Garg, Amaan Dhada, Mohammad Hassanpour, David Bick, Ganesh Venkatesh
Main category: cs.AI
TL;DR: The paper introduces CEPO, an optimized reasoning workflow that enables smaller open-source models to outperform larger models by addressing inefficiencies in LLM reasoning caused by model verbosity and poor instruction following.
Details
Motivation: Current LLM reasoning relies on extensive test-time computation but suffers from inefficiencies due to model verbosity and poor instruction following, leading to wasted compute resources.Method: The authors analyze the capability-cost trade-off and introduce CEPO, an optimized reasoning workflow that co-designs orchestration frameworks with underlying model capabilities.
Result: CEPO empowers smaller open-source models to outperform models multiple times their size, demonstrating improved efficiency in reasoning tasks.
Conclusion: The work shows a clear path toward co-designing orchestration frameworks with model capabilities to unlock powerful reasoning in small-to-medium sized models, and the workflow will be open-sourced for further research.
Abstract: Modern LLM reasoning relies on extensive test-time computation, driven by internal model training and external agentic orchestration. However, this synergy is often inefficient, as model verbosity and poor instruction following lead to wasted compute. We analyze this capability-cost trade-off and introduce an optimized reasoning workflow (\cepo) that empowers smaller open-source models to outperform models multiple times their size. We will open-source this workflow to enable further research. Our work demonstrates a clear path toward co-designing orchestration frameworks with the underlying model capabilities to unlock powerful reasoning in small-to-medium sized models.
[294] Agentic Metacognition: Designing a “Self-Aware” Low-Code Agent for Failure Prediction and Human Handoff
Jiexi Xu
Main category: cs.AI
TL;DR: A metacognitive layer for LCNC agents predicts failures and initiates human handoffs to improve reliability and trust.
Details
Motivation: Address reliability challenges in autonomous agents within LCNC environments, where non-deterministic behavior leads to loops, inaccurate outputs, and failures that frustrate users and break trust.Method: Integration of a secondary metacognitive layer that monitors the primary LCNC agent, predicts impending failures based on triggers like excessive latency or repetitive actions, and proactively initiates human handoffs with transparency.
Result: Empirical analysis shows significant increase in overall task success rate, though with notable computational overhead increase.
Conclusion: Human handoffs should be reframed as a core design feature that enhances resilience, improves user experience, and builds trust through transparency; discusses practical/ethical implications and future research directions.
Abstract: The inherent non-deterministic nature of autonomous agents, particularly within low-code/no-code (LCNC) environments, presents significant reliability challenges. Agents can become trapped in unforeseen loops, generate inaccurate outputs, or encounter unrecoverable failures, leading to user frustration and a breakdown of trust. This report proposes a novel architectural pattern to address these issues: the integration of a secondary, “metacognitive” layer that actively monitors the primary LCNC agent. Inspired by human introspection, this layer is designed to predict impending task failures based on a defined set of triggers, such as excessive latency or repetitive actions. Upon predicting a failure, the metacognitive agent proactively initiates a human handoff, providing the user with a clear summary of the agent’s “thought process” and a detailed explanation of why it could not proceed. An empirical analysis of a prototype system demonstrates that this approach significantly increases the overall task success rate. However, this performance gain comes with a notable increase in computational overhead. The findings reframe human handoffs not as an admission of defeat but as a core design feature that enhances system resilience, improves user experience, and builds trust by providing transparency into the agent’s internal state. The report discusses the practical and ethical implications of this approach and identifies key directions for future research.
[295] Analysis of approximate linear programming solution to Markov decision problem with log barrier function
Donghwan Lee, Hyukjun Yang, Bum Geun Park
Main category: cs.AI
TL;DR: This paper establishes a theoretical foundation for solving LP-based MDPs using log-barrier functions to transform inequality-constrained optimization into unconstrained optimization solvable via gradient descent.
Details
Motivation: LP-based methods for MDPs have been underused despite recent attention in offline RL, primarily because they lead to challenging inequality-constrained optimization problems compared to Bellman-equation-based methods.Method: The paper leverages log-barrier functions to transform the LP formulation of MDPs into an unconstrained optimization problem, enabling approximate solutions via gradient descent.
Result: The method provides a practical approach to solving LP-based MDPs, though a thorough theoretical interpretation of this approach has not been previously developed.
Conclusion: This work bridges the theoretical gap by establishing foundations for more effective LP-based MDP solving using log-barrier transformations and gradient descent optimization.
Abstract: There are two primary approaches to solving Markov decision problems (MDPs): dynamic programming based on the Bellman equation and linear programming (LP). Dynamic programming methods are the most widely used and form the foundation of both classical and modern reinforcement learning (RL). By contrast, LP-based methods have been less commonly employed, although they have recently gained attention in contexts such as offline RL. The relative underuse of the LP-based methods stems from the fact that it leads to an inequality-constrained optimization problem, which is generally more challenging to solve effectively compared with Bellman-equation-based methods. The purpose of this paper is to establish a theoretical foundation for solving LP-based MDPs in a more effective and practical manner. Our key idea is to leverage the log-barrier function, widely used in inequality-constrained optimization, to transform the LP formulation of the MDP into an unconstrained optimization problem. This reformulation enables approximate solutions to be obtained easily via gradient descent. While the method may appear simple, to the best of our knowledge, a thorough theoretical interpretation of this approach has not yet been developed. This paper aims to bridge this gap.
[296] LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation
Huizhen Shu, Xuying Li, Zhuo Li
Main category: cs.AI
TL;DR: LATENTGUARD is a three-stage framework combining behavioral alignment with supervised latent space control for interpretable safety steering in LLMs, achieving selective refusal while preserving utility.
Details
Motivation: Existing approaches struggle to balance comprehensive safety with fine-grained controllability at the representation level in large language models.Method: Three-stage framework: 1) Fine-tuning on rationalized datasets with reasoning-enhanced responses, 2) Training structured VAE on MLP activations with multi-label supervision, 3) Targeted manipulation of learned latent dimensions for selective refusal.
Result: Experiments on Qwen3-8B show significant improvements in safety controllability and response interpretability without utility loss. Cross-architecture validation on Mistral-7B confirms generalizability.
Conclusion: Structured representation-level intervention offers a promising pathway toward building safer yet practical LLM systems.
Abstract: Achieving robust safety alignment in large language models (LLMs) while preserving their utility remains a fundamental challenge. Existing approaches often struggle to balance comprehensive safety with fine-grained controllability at the representation level. We introduce LATENTGUARD, a novel three-stage framework that combines behavioral alignment with supervised latent space control for interpretable and precise safety steering. Our approach begins by fine-tuning an LLM on rationalized datasets containing both reasoning-enhanced refusal responses to adversarial prompts and reasoning-enhanced normal responses to benign queries, establishing robust behavioral priors across both safety-critical and utility-preserving scenarios. We then train a structured variational autoencoder (VAE) on intermediate MLP activations, supervised by multi-label annotations including attack types, attack methods, and benign indicators. This supervision enables the VAE to learn disentangled latent representations that capture distinct adversarial characteristics while maintaining semantic interpretability. Through targeted manipulation of learned latent dimensions, LATENTGUARD achieves selective refusal behavior, effectively blocking harmful requests while preserving helpfulness for legitimate use cases. Experiments on Qwen3-8B demonstrate significant improvements in both safety controllability and response interpretability without compromising utility. Cross-architecture validation on Mistral-7B confirms the generalizability of our latent steering approach, showing consistent effectiveness across different model families. Our results suggest that structured representation-level intervention offers a promising pathway toward building safer yet practical LLM systems.
[297] CON-QA: Privacy-Preserving QA using cloud LLMs in Contract Domain
Ajeet Kumar Singh, Rajsabi Surya, Anurag Tripathi, Santanu Choudhury, Sudhir Bisane
Main category: cs.AI
TL;DR: CON-QA is a hybrid privacy-preserving framework for secure question answering over enterprise contracts using both local and cloud-based LLMs, designed to protect sensitive information while maintaining utility.
Details
Motivation: Enterprises face critical challenges in protecting sensitive contractual information (PII and commercial clauses) when using cloud-based LLMs like ChatGPT and Gemini for legal document workflows.Method: Three-stage framework: (i) semantic query decomposition and chunk retrieval using local LLM, (ii) anonymization via structured one-to-many mapping to prevent entity inference attacks, (iii) cloud-based response generation with local reconstruction using reverse mapping.
Result: Empirical evaluations using CUAD-QA corpus (85k QA pairs over 510 contracts) show CON-QA effectively maintains privacy and utility, preserves answer quality and legal semantics, and significantly mitigates privacy risks.
Conclusion: CON-QA demonstrates practical suitability for secure enterprise-level contract document processing by balancing privacy protection with functional utility.
Abstract: As enterprises increasingly integrate cloud-based large language models (LLMs) such as ChatGPT and Gemini into their legal document workflows, protecting sensitive contractual information - including Personally Identifiable Information (PII) and commercially sensitive clauses - has emerged as a critical challenge. In this work, we propose CON-QA, a hybrid privacy-preserving framework designed specifically for secure question answering over enterprise contracts, effectively combining local and cloud-hosted LLMs. The CON-QA framework operates through three stages: (i) semantic query decomposition and query-aware document chunk retrieval using a locally deployed LLM analysis, (ii) anonymization of detected sensitive entities via a structured one-to-many mapping scheme, ensuring semantic coherence while preventing cross-session entity inference attacks, and (iii) anonymized response generation by a cloud-based LLM, with accurate reconstruction of the original answer locally using a session-consistent many-to-one reverse mapping. To rigorously evaluate CON-QA, we introduce CUAD-QA, a corpus of 85k question-answer pairs generated over 510 real-world CUAD contract documents, encompassing simple, complex, and summarization-style queries. Empirical evaluations, complemented by detailed human assessments, confirm that CON-QA effectively maintains both privacy and utility, preserves answer quality, maintains fidelity to legal clause semantics, and significantly mitigates privacy risks, demonstrating its practical suitability for secure, enterprise-level contract documents.
[298] Embodied AI: From LLMs to World Models
Tongtong Feng, Xin Wang, Yu-Gang Jiang, Wenwu Zhu
Main category: cs.AI
TL;DR: This paper provides a comprehensive survey of Embodied Artificial Intelligence (AI), exploring how Large Language Models (LLMs) and World Models (WMs) empower embodied AI systems for achieving Artificial General Intelligence (AGI).
Details
Motivation: The motivation is to systematically review the evolution of embodied AI, particularly focusing on recent breakthroughs in LLMs and WMs, and to demonstrate their combined potential in enabling complex tasks within physical worlds.Method: The paper presents a comprehensive literature review covering: history, key technologies, components, and hardware systems of embodied AI; analysis of LLM/MLLM-driven and WM-driven approaches; examination of joint MLLM-WM architectures; and discussion of representative applications.
Result: The survey identifies that LLMs enable semantic reasoning and task decomposition for embodied cognition, while WMs facilitate physical law-compliant interactions through internal world representations and predictions. The joint MLLM-WM architecture shows promise for complex real-world tasks.
Conclusion: Embodied AI represents a crucial paradigm for AGI development, with LLMs and WMs playing complementary roles. Future research should focus on integrating these approaches more effectively and addressing remaining challenges in real-world applications.
Abstract: Embodied Artificial Intelligence (AI) is an intelligent system paradigm for achieving Artificial General Intelligence (AGI), serving as the cornerstone for various applications and driving the evolution from cyberspace to physical systems. Recent breakthroughs in Large Language Models (LLMs) and World Models (WMs) have drawn significant attention for embodied AI. On the one hand, LLMs empower embodied AI via semantic reasoning and task decomposition, bringing high-level natural language instructions and low-level natural language actions into embodied cognition. On the other hand, WMs empower embodied AI by building internal representations and future predictions of the external world, facilitating physical law-compliant embodied interactions. As such, this paper comprehensively explores the literature in embodied AI from basics to advances, covering both LLM driven and WM driven works. In particular, we first present the history, key technologies, key components, and hardware systems of embodied AI, as well as discuss its development via looking from unimodal to multimodal angle. We then scrutinize the two burgeoning fields of embodied AI, i.e., embodied AI with LLMs/multimodal LLMs (MLLMs) and embodied AI with WMs, meticulously delineating their indispensable roles in end-to-end embodied cognition and physical laws-driven embodied interactions. Building upon the above advances, we further share our insights on the necessity of the joint MLLM-WM driven embodied AI architecture, shedding light on its profound significance in enabling complex tasks within physical worlds. In addition, we examine representative applications of embodied AI, demonstrating its wide applicability in real-world scenarios. Last but not least, we point out future research directions of embodied AI that deserve further investigation.
[299] MACD: Multi-Agent Clinical Diagnosis with Self-Learned Knowledge for LLM
Wenliang Li, Rui Yan, Xu Zhang, Li Chen, Hongji Zhu, Jing Zhao, Junjun Li, Mengru Li, Wei Cao, Zihang Jiang, Wei Wei, Kun Zhang, Shaohua Kevin Zhou
Main category: cs.AI
TL;DR: The paper proposes MACD, a multi-agent framework that enables LLMs to self-learn clinical knowledge through experience accumulation, significantly improving diagnostic accuracy and achieving performance comparable to or exceeding human physicians.
Details
Motivation: Current LLM approaches for clinical diagnosis optimize isolated inferences but neglect the accumulation of reusable clinical experience, which is crucial for developing diagnostic expertise similar to how physicians learn through practice.Method: MACD uses a multi-agent pipeline where LLM-based diagnostician agents summarize, refine, and apply diagnostic insights through iterative consultations, with human oversight for unresolved cases. The framework includes an evaluator agent and supports MACD-human collaboration.
Result: Evaluated on 4,390 real-world cases across 7 diseases, MACD improved primary diagnostic accuracy by up to 22.3% over established guidelines. It achieved performance on par with or exceeding human physicians (up to 16% improvement) and 18.6% improvement in MACD-human workflow. The system also demonstrated cross-model stability and transferability.
Conclusion: MACD presents a scalable self-learning paradigm that bridges the gap between LLMs’ intrinsic knowledge and real-world clinical practice, generating traceable rationales for enhanced explainability.
Abstract: Large language models (LLMs) have demonstrated notable potential in medical applications, yet they face substantial challenges in handling complex real-world clinical diagnoses using conventional prompting methods. Current prompt engineering and multi-agent approaches typically optimize isolated inferences, neglecting the accumulation of reusable clinical experience. To address this, this study proposes a novel Multi-Agent Clinical Diagnosis (MACD) framework, which allows LLMs to self-learn clinical knowledge via a multi-agent pipeline that summarizes, refines, and applies diagnostic insights. It mirrors how physicians develop expertise through experience, enabling more focused and accurate diagnosis on key disease-specific cues. We further extend it to a MACD-human collaborative workflow, where multiple LLM-based diagnostician agents engage in iterative consultations, supported by an evaluator agent and human oversight for cases where agreement is not reached. Evaluated on 4,390 real-world patient cases across seven diseases using diverse open-source LLMs (Llama-3.1 8B/70B, DeepSeek-R1-Distill-Llama 70B), MACD significantly improves primary diagnostic accuracy, outperforming established clinical guidelines with gains up to 22.3% (MACD). On the subset of the data, it achieves performance on par with or exceeding that of human physicians (up to 16% improvement over physicians-only diagnosis). Additionally, on the MACD-human workflow, it achieves an 18.6% improvement compared to physicians-only diagnosis. Moreover, self-learned knowledge exhibits strong cross-model stability, transferability, and model-specific personalization, while the system can generate traceable rationales, enhancing explainability. Consequently, this work presents a scalable self-learning paradigm for LLM-assisted diagnosis, bridging the gap between the intrinsic knowledge of LLMs and real-world clinical practice.
[300] From Pheromones to Policies: Reinforcement Learning for Engineered Biological Swarms
Aymeric Vellinger, Nemanja Antonic, Elio Tuci
Main category: cs.AI
TL;DR: This paper establishes theoretical equivalence between pheromone-mediated aggregation in C. elegans and reinforcement learning, showing how stigmergic signals function as distributed reward mechanisms.
Details
Motivation: To bridge synthetic biology with swarm robotics by demonstrating that stigmergic systems inherently encode distributed reinforcement learning processes, advancing programmable living systems for resilient decision-making.Method: Modeled engineered nematode swarms performing foraging tasks, showing pheromone dynamics mathematically mirror cross-learning updates. Validated with literature data and computational experiments in multi-armed bandit scenarios.
Result: The model accurately replicates empirical C. elegans foraging patterns. In dynamic environments, persistent pheromone trails hinder adaptation, but introducing exploratory agents restores collective plasticity and enables rapid task switching.
Conclusion: Stigmergic systems encode distributed RL processes where environmental signals act as external memory for collective credit assignment, with behavioral heterogeneity balancing exploration-exploitation trade-offs for swarm-level strategy extinction.
Abstract: Swarm intelligence emerges from decentralised interactions among simple agents, enabling collective problem-solving. This study establishes a theoretical equivalence between pheromone-mediated aggregation in \celeg\ and reinforcement learning (RL), demonstrating how stigmergic signals function as distributed reward mechanisms. We model engineered nematode swarms performing foraging tasks, showing that pheromone dynamics mathematically mirror cross-learning updates, a fundamental RL algorithm. Experimental validation with data from literature confirms that our model accurately replicates empirical \celeg\ foraging patterns under static conditions. In dynamic environments, persistent pheromone trails create positive feedback loops that hinder adaptation by locking swarms into obsolete choices. Through computational experiments in multi-armed bandit scenarios, we reveal that introducing a minority of exploratory agents insensitive to pheromones restores collective plasticity, enabling rapid task switching. This behavioural heterogeneity balances exploration-exploitation trade-offs, implementing swarm-level extinction of outdated strategies. Our results demonstrate that stigmergic systems inherently encode distributed RL processes, where environmental signals act as external memory for collective credit assignment. By bridging synthetic biology with swarm robotics, this work advances programmable living systems capable of resilient decision-making in volatile environments.
[301] Steerable Adversarial Scenario Generation through Test-Time Preference Alignment
Tong Nie, Yuewen Mei, Yihong Tang, Junlin He, Jie Sun, Haotian Shi, Wei Ma, Jian Sun
Main category: cs.AI
TL;DR: SAGE is a steerable adversarial scenario generation framework that enables fine-grained control over the trade-off between adversariality and realism without retraining, using hierarchical group-based preference optimization and weight interpolation.
Details
Motivation: Existing adversarial scenario generation methods are constrained to fixed trade-offs between objectives like adversariality and realism, lacking flexibility for diverse training and testing requirements.Method: Proposes hierarchical group-based preference optimization to decouple hard constraints from soft preferences, fine-tunes two experts on opposing preferences, and constructs continuous policies via weight interpolation at inference time.
Result: SAGE generates scenarios with superior balance of adversariality and realism, and enables more effective closed-loop training of driving policies.
Conclusion: The framework provides theoretical justification through linear mode connectivity and demonstrates practical effectiveness for flexible adversarial scenario generation in autonomous driving safety assessment.
Abstract: Adversarial scenario generation is a cost-effective approach for safety assessment of autonomous driving systems. However, existing methods are often constrained to a single, fixed trade-off between competing objectives such as adversariality and realism. This yields behavior-specific models that cannot be steered at inference time, lacking the efficiency and flexibility to generate tailored scenarios for diverse training and testing requirements. In view of this, we reframe the task of adversarial scenario generation as a multi-objective preference alignment problem and introduce a new framework named \textbf{S}teerable \textbf{A}dversarial scenario \textbf{GE}nerator (SAGE). SAGE enables fine-grained test-time control over the trade-off between adversariality and realism without any retraining. We first propose hierarchical group-based preference optimization, a data-efficient offline alignment method that learns to balance competing objectives by decoupling hard feasibility constraints from soft preferences. Instead of training a fixed model, SAGE fine-tunes two experts on opposing preferences and constructs a continuous spectrum of policies at inference time by linearly interpolating their weights. We provide theoretical justification for this framework through the lens of linear mode connectivity. Extensive experiments demonstrate that SAGE not only generates scenarios with a superior balance of adversariality and realism but also enables more effective closed-loop training of driving policies. Project page: https://tongnie.github.io/SAGE/.
[302] PEPS: Quantum-Inspired Reinforcement Learning for Coherent Reasoning Traces in LLMs
Venkat Margapuri, Garik Kazanjian, Naren Kosaraju
Main category: cs.AI
TL;DR: A quantum-inspired approach using PEPS fidelity-based reward in PPO improves LLM reasoning coherence across arithmetic, intuitive, and entailment tasks.
Details
Motivation: LLMs struggle with maintaining coherent multi-step reasoning traces that require structured logical flow, needing better methods to enforce global coherence.Method: Incorporates fidelity-based reward from Projected Entangled Pair States (PEPS) into Proximal Policy Optimization to guide learning through structural consistency rather than direct supervision.
Result: Significant improvements over supervised, contrastive, and pretrained baselines on GSM8K, StrategyQA, and EntailmentBank datasets using coherence-determining metrics.
Conclusion: Quantum-inspired fidelity serves as an effective foundation to improve reasoning trace coherence in LLMs, offering a novel approach for structural consistency.
Abstract: Large Language Models (LLMs) often struggle with maintaining coherent multi-step reasoning traces, particularly in tasks that require a structured logical flow. This work introduces a quantum-inspired approach to address the challenge by incorporating a fidelity-based reward derived from Projected Entangled Pair States (PEPS) into Proximal Policy Optimization. Unlike prior approaches that use direct supervision or contrastive objectives, the proposed method guides learning through structural consistency, offering a novel approach to enforce global coherence in generated reasoning traces. The proposed framework is evaluated using multiple coherence-determining metrics on diverse datasets such as GSM8K, StrategyQA, and EntailmentBank spanning arithmetic, intuitive, and entailment-based reasoning. Results show that the proposed quantum-inspired approach offers significant improvements over supervised, contrastive, and pretrained baseline approaches, highlighting the effectiveness of quantum-inspired fidelity as a foundation to improve reasoning trace coherence in LLMs.
[303] Formal Verification of Minimax Algorithms
Wieger Wesselink, Kees Huizing, Huub van de Wetering
Main category: cs.AI
TL;DR: Formal verification of minimax search algorithms using Dafny, including alpha-beta pruning and transposition tables, with a new witness-based correctness criterion for depth-limited search.
Details
Motivation: To provide formal guarantees of correctness for minimax search algorithms commonly used in game playing and AI applications, ensuring their reliability and correctness through mathematical verification.Method: Used the Dafny verification system to formally verify various minimax algorithms, introduced a witness-based correctness criterion for depth-limited search with transposition tables, and applied this to two representative algorithms.
Result: Successfully verified a range of minimax search algorithms with formal proofs, including variations with alpha-beta pruning and transposition tables. All verification artifacts and Python implementations are publicly available.
Conclusion: The paper demonstrates that formal verification of complex search algorithms is feasible and provides a rigorous foundation for ensuring algorithm correctness in AI applications.
Abstract: Using the Dafny verification system, we formally verify a range of minimax search algorithms, including variations with alpha-beta pruning and transposition tables. For depth-limited search with transposition tables, we introduce a witness-based correctness criterion and apply it to two representative algorithms. All verification artifacts, including proofs and Python implementations, are publicly available.
[304] Federation of Agents: A Semantics-Aware Communication Fabric for Large-Scale Agentic AI
Lorenzo Giusti, Ole Anton Werner, Riccardo Taiello, Matilde Carvalho Costa, Emre Tosun, Andrea Protani, Marc Molina, Rodrigo Lopes de Almeida, Paolo Cacace, Diogo Reis Santos, Luigi Serio
Main category: cs.AI
TL;DR: Federation of Agents (FoA) is a distributed orchestration framework that enables dynamic, capability-driven collaboration among AI agents through semantic routing, dynamic task decomposition, and smart clustering.
Details
Motivation: To transform static multi-agent coordination into dynamic, capability-driven collaboration by making agent capabilities searchable and enabling efficient task distribution among heterogeneous AI agents.Method: Uses Versioned Capability Vectors (VCVs) for machine-readable agent profiles, semantic routing with HNSW indices, dynamic task decomposition through consensus-based merging, and smart clustering for collaborative refinement. Built on MQTT publish-subscribe semantics.
Result: Achieves 13x improvements over single-model baselines on HealthBench, with clustering-enhanced collaboration particularly effective for complex reasoning tasks. The system scales horizontally while maintaining consistent performance.
Conclusion: Semantic orchestration with structured collaboration can unlock the collective intelligence of heterogeneous federations of AI agents, demonstrating scalable and efficient multi-agent coordination.
Abstract: We present Federation of Agents (FoA), a distributed orchestration framework that transforms static multi-agent coordination into dynamic, capability-driven collaboration. FoA introduces Versioned Capability Vectors (VCVs): machine-readable profiles that make agent capabilities searchable through semantic embeddings, enabling agents to advertise their capabilities, cost, and limitations. Our aarchitecturecombines three key innovations: (1) semantic routing that matches tasks to agents over sharded HNSW indices while enforcing operational constraints through cost-biased optimization, (2) dynamic task decomposition where compatible agents collaboratively break down complex tasks into DAGs of subtasks through consensus-based merging, and (3) smart clustering that groups agents working on similar subtasks into collaborative channels for k-round refinement before synthesis. Built on top of MQTT,s publish-subscribe semantics for scalable message passing, FoA achieves sub-linear complexity through hierarchical capability matching and efficient index maintenance. Evaluation on HealthBench shows 13x improvements over single-model baselines, with clustering-enhanced laboration particularly effective for complex reasoning tasks requiring multiple perspectives. The system scales horizontally while maintaining consistent performance, demonstrating that semantic orchestration with structured collaboration can unlock the collective intelligence of heterogeneous federations of AI agents.
[305] Design Insights and Comparative Evaluation of a Hardware-Based Cooperative Perception Architecture for Lane Change Prediction
Mohamed Manzour, Catherine M. Elias, Omar M. Shehata, Rubén Izquierdo, Miguel Ángel Sotelo
Main category: cs.AI
TL;DR: This paper presents a real-world deployment of cooperative lane-change prediction in mixed traffic, highlighting practical challenges and lessons learned that are often under-documented in simulation-based studies.
Details
Motivation: Most lane-change prediction research relies on simulations or pre-recorded datasets with simplified assumptions that don't hold in practice. Real-world deployments are rare and their practical challenges are under-documented.Method: The study explores cooperative lane-change prediction through actual hardware deployment in mixed traffic conditions, documenting implementation and testing experiences.
Result: The research identified practical challenges including bottlenecks, reliability issues, and operational constraints that significantly shaped system behavior.
Conclusion: By documenting real-world deployment experiences, this study provides valuable guidance for others working on similar lane-change prediction pipelines, addressing gaps in practical implementation knowledge.
Abstract: Research on lane change prediction has gained attention in the last few years. Most existing works in this area have been conducted in simulation environments or with pre-recorded datasets, these works often rely on simplified assumptions about sensing, communication, and traffic behavior that do not always hold in practice. Real-world deployments of lane-change prediction systems are relatively rare, and when they are reported, the practical challenges, limitations, and lessons learned are often under-documented. This study explores cooperative lane-change prediction through a real hardware deployment in mixed traffic and shares the insights that emerged during implementation and testing. We highlight the practical challenges we faced, including bottlenecks, reliability issues, and operational constraints that shaped the behavior of the system. By documenting these experiences, the study provides guidance for others working on similar pipelines.
[306] Scan-do Attitude: Towards Autonomous CT Protocol Management using a Large Language Model Agent
Xingjian Kang, Linda Vorberg, Andreas Maier, Alexander Katzmann, Oliver Taubmann
Main category: cs.AI
TL;DR: A Large Language Model (LLM)-based agent framework is proposed to assist with CT scan protocol management, enabling natural language interpretation and execution of protocol configuration requests to improve workflow efficiency.
Details
Motivation: Managing CT scan protocols is time-consuming and requires clinical/technical expertise, while there's an increasing shortage of skilled workforce in radiology.Method: The agent combines in-context-learning, instruction-following, and structured toolcalling abilities to identify relevant protocol elements and apply accurate modifications.
Result: Experimental results show the agent can effectively retrieve protocol components, generate device-compatible protocol definition files, and faithfully implement user requests.
Conclusion: The approach demonstrates feasibility for LLM-based agents in CT protocol management, though faces limitations regarding device API unification and handling ambiguous/complex requests.
Abstract: Managing scan protocols in Computed Tomography (CT), which includes adjusting acquisition parameters or configuring reconstructions, as well as selecting postprocessing tools in a patient-specific manner, is time-consuming and requires clinical as well as technical expertise. At the same time, we observe an increasing shortage of skilled workforce in radiology. To address this issue, a Large Language Model (LLM)-based agent framework is proposed to assist with the interpretation and execution of protocol configuration requests given in natural language or a structured, device-independent format, aiming to improve the workflow efficiency and reduce technologists’ workload. The agent combines in-context-learning, instruction-following, and structured toolcalling abilities to identify relevant protocol elements and apply accurate modifications. In a systematic evaluation, experimental results indicate that the agent can effectively retrieve protocol components, generate device compatible protocol definition files, and faithfully implement user requests. Despite demonstrating feasibility in principle, the approach faces limitations regarding syntactic and semantic validity due to lack of a unified device API, and challenges with ambiguous or complex requests. In summary, the findings show a clear path towards LLM-based agents for supporting scan protocol management in CT imaging.
[307] Tree Search for Language Model Agents
Jing Yu Koh, Stephen McAleer, Daniel Fried, Ruslan Salakhutdinov
Main category: cs.AI
TL;DR: Proposes an inference-time search algorithm for LM agents to perform exploration and multi-step planning in web environments, achieving significant performance improvements on VisualWebArena and WebArena benchmarks.
Details
Motivation: Language models struggle with multi-step reasoning, planning, and using environmental feedback for realistic computer tasks, limiting their effectiveness as autonomous agents in web automation.Method: A best-first tree search algorithm that operates within the actual environment space, designed to be complementary with existing state-of-the-art agents and specifically effective for web tasks.
Result: 39.7% relative increase in success rate on VisualWebArena (26.4% SOTA) and 28.0% relative improvement on WebArena (19.2% success rate), with performance scaling with increased test-time compute.
Conclusion: Search algorithms significantly improve LM agent performance on web tasks, demonstrating effectiveness of explicit exploration and planning, with promising directions for future work.
Abstract: Autonomous agents powered by language models (LMs) have demonstrated promise in their ability to perform decision-making tasks such as web automation. However, a key limitation remains: LMs, primarily optimized for natural language understanding and generation, struggle with multi-step reasoning, planning, and using environmental feedback when attempting to solve realistic computer tasks. Towards addressing this, we propose an inference-time search algorithm for LM agents to explicitly perform exploration and multi-step planning in interactive web environments. Our approach is a form of best-first tree search that operates within the actual environment space, and is complementary with most existing state-of-the-art agents. It is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks. On the challenging VisualWebArena benchmark, applying our search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search, setting a state-of-the-art success rate of 26.4%. On WebArena, search also yields a 28.0% relative improvement over a baseline agent, setting a competitive success rate of 19.2%. Our experiments highlight the effectiveness of search for web agents, and we demonstrate that performance scales with increased test-time compute. We conduct a thorough analysis of our results to highlight improvements from search, limitations, and promising directions for future work. Our code and models are publicly released at https://jykoh.com/search-agents.
[308] Reinforcement Learning and Machine ethics:a systematic review
Ajay Vishwanath, Louise A. Dennis, Marija Slavkovik
Main category: cs.AI
TL;DR: A systematic review of reinforcement learning applications in machine ethics, focusing on ethics specifications, RL components/frameworks, and environments used to achieve ethical behavior in autonomous systems.
Details
Motivation: Previous systematic reviews of machine ethics excluded reinforcement learning approaches, creating a gap in the state of the art. Recent years have seen increased use of RL in machine ethics studies, necessitating this comprehensive review.Method: Conducted a systematic review of reinforcement learning for machine ethics and machine ethics within reinforcement learning, analyzing trends in ethics specifications, RL components/frameworks, and environments.
Result: The review consolidates work in machine ethics and reinforcement learning, completing the gap in the current machine ethics landscape by including RL-based approaches that were previously excluded.
Conclusion: This systematic review provides a comprehensive overview of RL applications in machine ethics, addressing the previously identified gap and highlighting current trends and approaches in the field.
Abstract: Machine ethics is the field that studies how ethical behaviour can be accomplished by autonomous systems. While there exist some systematic reviews aiming to consolidate the state of the art in machine ethics prior to 2020, these tend to not include work that uses reinforcement learning agents as entities whose ethical behaviour is to be achieved. The reason for this is that only in the last years we have witnessed an increase in machine ethics studies within reinforcement learning. We present here a systematic review of reinforcement learning for machine ethics and machine ethics within reinforcement learning. Additionally, we highlight trends in terms of ethics specifications, components and frameworks of reinforcement learning, and environments used to result in ethical behaviour. Our systematic review aims to consolidate the work in machine ethics and reinforcement learning thus completing the gap in the state of the art machine ethics landscape
[309] Multi-Agents are Social Groups: Investigating Social Influence of Multiple Agents in Human-Agent Interactions
Tianqi Song, Yugin Tan, Zicheng Zhu, Yibin Feng, Yi-Chieh Lee
Main category: cs.AI
TL;DR: Multi-agent AI systems can create social pressure similar to human group influence, causing users to shift opinions more effectively than single-agent systems.
Details
Motivation: To investigate whether groups of AI agents can create social pressure that influences users' opinions, drawing inspiration from human group social influence phenomena.Method: Conducted a study where participants discussed social issues with either single or multiple AI agents, with agents either agreeing or disagreeing with users’ stances on topics.
Result: Multiple agents increased social pressure and caused greater opinion shifts toward agents’ stances compared to single agents, even when conversation content was kept constant.
Conclusion: Multi-agent systems have advantages over single-agent platforms for opinion change, with potential applications for social good but also risks of malicious manipulation of public opinion.
Abstract: Multi-agent systems - systems with multiple independent AI agents working together to achieve a common goal - are becoming increasingly prevalent in daily life. Drawing inspiration from the phenomenon of human group social influence, we investigate whether a group of AI agents can create social pressure on users to agree with them, potentially changing their stance on a topic. We conducted a study in which participants discussed social issues with either a single or multiple AI agents, and where the agents either agreed or disagreed with the user’s stance on the topic. We found that conversing with multiple agents (holding conversation content constant) increased the social pressure felt by participants, and caused a greater shift in opinion towards the agents’ stances on each topic. Our study shows the potential advantages of multi-agent systems over single-agent platforms in causing opinion change. We discuss design implications for possible multi-agent systems that promote social good, as well as the potential for malicious actors to use these systems to manipulate public opinion.
[310] Enhancing Crash Frequency Modeling Based on Augmented Multi-Type Data by Hybrid VAE-Diffusion-Based Generative Neural Networks
Junlan Chen, Qijie He, Pei Liu, Wei Ma, Ziyuan Pu, Nan Zheng
Main category: cs.AI
TL;DR: Proposes a hybrid VAE-Diffusion neural network to address excessive zero observations in crash frequency modeling, outperforming traditional statistical methods.
Details
Motivation: Inaccurate crash frequency predictions due to excessive zero observations can lead to misguided policies and wasted resources, jeopardizing traffic safety. Existing approaches have limitations like restrictive assumptions or information loss.Method: A hybrid VAE-Diffusion neural network designed to reduce zero observations and handle multi-type tabular crash data (count, ordinal, nominal, and real-valued variables). Synthetic data quality is assessed through similarity, accuracy, diversity, and structural consistency metrics.
Result: The hybrid VAE-Diffusion model outperforms baseline models across all metrics, offering more effective crash data augmentation and improved prediction accuracy.
Conclusion: The study demonstrates the potential of synthetic data generated by the hybrid VAE-Diffusion model to enhance traffic safety by improving crash frequency modeling and informing better policy decisions.
Abstract: Crash frequency modelling analyzes the impact of factors like traffic volume, road geometry, and environmental conditions on crash occurrences. Inaccurate predictions can distort our understanding of these factors, leading to misguided policies and wasted resources, which jeopardize traffic safety. A key challenge in crash frequency modelling is the prevalence of excessive zero observations, caused by underreporting, the low probability of crashes, and high data collection costs. These zero observations often reduce model accuracy and introduce bias, complicating safety decision making. While existing approaches, such as statistical methods, data aggregation, and resampling, attempt to address this issue, they either rely on restrictive assumptions or result in significant information loss, distorting crash data. To overcome these limitations, we propose a hybrid VAE-Diffusion neural network, designed to reduce zero observations and handle the complexities of multi-type tabular crash data (count, ordinal, nominal, and real-valued variables). We assess the synthetic data quality generated by this model through metrics like similarity, accuracy, diversity, and structural consistency, and compare its predictive performance against traditional statistical models. Our findings demonstrate that the hybrid VAE-Diffusion model outperforms baseline models across all metrics, offering a more effective approach to augmenting crash data and improving the accuracy of crash frequency predictions. This study highlights the potential of synthetic data to enhance traffic safety by improving crash frequency modelling and informing better policy decisions.
[311] STRIVE: Structured Reasoning for Self-Improvement in Claim Verification
Haisong Gong, Jing Li, Junfei Wu, Qiang Liu, Shu Wu, Liang Wang
Main category: cs.AI
TL;DR: STRIVE introduces structured reasoning with Claim Decomposition, Entity Analysis, and Evidence Grounding Verification to improve self-improvement methods for claim verification, achieving 31.4% performance gain over base models.
Details
Motivation: Standard self-improvement methods struggle in claim verification because low-quality reasoning chains may falsely match binary truth labels, introducing faulty reasoning into training and degrading performance.Method: STRIVE uses structured reasoning design with three components: Claim Decomposition, Entity Analysis, and Evidence Grounding Verification. It begins with warm-up fine-tuning on annotated examples, then generates reasoning chains and selects only correct and structurally sound ones for self-improvement training.
Result: STRIVE achieves significant improvements with 31.4% performance gain over the base model and 20.7% over Chain of Thought on the HOVER datasets.
Conclusion: The structured reasoning approach effectively addresses limitations of standard self-improvement methods in claim verification by improving reasoning quality and providing better supervision signals.
Abstract: Claim verification is the task of determining whether a claim is supported or refuted by evidence. Self-improvement methods, where reasoning chains are generated and those leading to correct results are selected for training, have succeeded in tasks like mathematical problem solving. However, in claim verification, this approach struggles. Low-quality reasoning chains may falsely match binary truth labels, introducing faulty reasoning into the self-improvement process and ultimately degrading performance. To address this, we propose STRIVE: Structured Reasoning for Self-Improved Verification. Our method introduces a structured reasoning design with Claim Decomposition, Entity Analysis, and Evidence Grounding Verification. These components improve reasoning quality, reduce errors, and provide additional supervision signals for self-improvement. STRIVE begins with a warm-up phase, where the base model is fine-tuned on a small number of annotated examples to learn the structured reasoning design. It is then applied to generate reasoning chains for all training examples, selecting only those that are correct and structurally sound for subsequent self-improvement training. We demonstrate that STRIVE achieves significant improvements over baseline models, with a 31.4% performance gain over the base model and 20.7% over Chain of Thought on the HOVER datasets, highlighting its effectiveness.
[312] CNS-Obsidian: A Neurosurgical Vision-Language Model Built From Scientific Publications
Anton Alyakin, Jaden Stryker, Daniel Alexander Alber, Karl L. Sangwon, Jin Vivian Lee, Brandon Duderstadt, Akshay Save, David Kurland, Spencer Frome, Shrutika Singh, Jeff Zhang, Eunice Yang, Ki Yun Park, Cordelia Orillac, Aly A. Valliani, Sean Neifert, Albert Liu, Aneek Patel, Christopher Livia, Darryl Lau, Ilya Laufer, Peter A. Rozman, Eveline Teresa Hidalgo, Howard Riina, Rui Feng, Todd Hollon, Yindalon Aphinyanaphongs, John G. Golfinos, Laura Snyder, Eric Leuthardt, Douglas Kondziolka, Eric Karl Oermann
Main category: cs.AI
TL;DR: CNS-Obsidian is a specialized neurosurgical VLM trained on curated peer-reviewed literature that matches GPT-4o on synthetic questions but underperforms on human-generated questions and clinical deployment.
Details
Motivation: General-purpose VLMs trained on uncurated internet data have limitations for high-stakes medical decision-making like neurosurgery, necessitating domain-specific models trained on curated scientific literature.Method: Compiled 23,984 neurosurgical articles to create 263,064 training samples, then fine-tuned LLaVA-Next model. Conducted blinded randomized trial comparing CNS-Obsidian vs GPT-4o as diagnostic co-pilots in real neurosurgical consultations.
Result: CNS-Obsidian matched GPT-4o on synthetic questions (76.13% vs 77.54%) but significantly underperformed on human-generated questions (46.81% vs 65.70%). In clinical deployment, CNS-Obsidian received positive ratings in 40.62% vs 57.89% for GPT-4o, though both included correct diagnosis in ~60% of cases.
Conclusion: Domain-specific VLMs can approach frontier model performance in specialized medical domains despite being smaller and cheaper, but low clinical utilization suggests chatbot interfaces may not align with specialist workflows, indicating need for alternative AI integration strategies.
Abstract: General-purpose vision-language models (VLMs) demonstrate impressive capabilities, but their opaque training on uncurated internet data posse critical limitations for high-stakes decision-making, such as in neurosurgery. We present CNS-Obsidian, a neurosurgical VLM trained on peer-reviewed neurosurgical literature, and demonstrate its clinical utility compared with GPT-4o in a real-world setting. We compiled 23,984 articles from Neurosurgery Publications journals, yielding 78,853 figures and captions. Using GPT-4o and Claude Sonnet-3.5, we converted these image-text pairs into 263,064 training samples across three formats: instruction fine-tuning, multiple-choice questions, and differential diagnosis. We trained CNS-Obsidian, a fine-tune of the 34-billion parameter LLaVA-Next model. In a blinded, randomized deployment trial at NYU Langone Health (Aug 30-Nov 30, 2024), neurosurgeons were assigned to use either CNS-Obsidian or GPT-4o as a diagnostic co-pilot after patient consultations. Primary outcomes were diagnostic helpfulness and accuracy. CNS-Obsidian matched GPT-4o on synthetic questions (76.13% vs 77.54%, p=0.235), but only achieved 46.81% accuracy on human-generated questions versus GPT-4o’s 65.70% (p<10-15). In the randomized trial, 70 consultations were evaluated (32 CNS-Obsidian, 38 GPT-4o) from 959 total consults. CNS-Obsidian received positive ratings in 40.62% of cases versus 57.89% for GPT-4o (p=0.230). Both models included correct diagnosis in approximately 60% of cases (59.38% vs 65.79%, p=0.626). Domain-specific VLMs trained on curated scientific literature can approach frontier model performance in specialized medical domains despite being orders of magnitude smaller and less expensive to train. However, low clinical utilization suggests chatbot interfaces may not align with specialist workflows, indicating need for alternative AI integration strategies.
[313] AutoEval: A Practical Framework for Autonomous Evaluation of Mobile Agents
Jiahui Sun, Zhichao Hua, Yubin Xia
Main category: cs.AI
TL;DR: AutoEval is an automated evaluation framework for mobile agents that eliminates manual effort by using UI state change representation and a Judge System to generate reward signals and conduct autonomous evaluation.
Details
Motivation: Existing benchmarks for mobile agents lack practicality and scalability due to extensive manual effort required for defining task reward signals and implementing evaluation codes.Method: The approach uses a UI state change representation to automatically generate task reward signals and employs a Judge System for autonomous evaluation without human intervention.
Result: AutoEval achieves high correlation with human-annotated signals and up to 94% accuracy in autonomous evaluation, comparable to human evaluation. It successfully evaluates state-of-the-art mobile agents.
Conclusion: AutoEval provides an effective automated evaluation framework that advances mobile agent development by eliminating manual evaluation efforts while maintaining high accuracy comparable to human evaluation.
Abstract: Comprehensive evaluation of mobile agents can significantly advance their development and real-world applicability. However, existing benchmarks lack practicality and scalability due to the extensive manual effort in defining task reward signals and implementing evaluation codes. We propose AutoEval, an evaluation framework which tests mobile agents without any manual effort. Our approach designs a UI state change representation which can be used to automatically generate task reward signals, and employs a Judge System for autonomous evaluation. Evaluation shows AutoEval can automatically generate reward signals with high correlation to human-annotated signals, and achieve high accuracy (up to 94%) in autonomous evaluation comparable to human evaluation. Finally, we evaluate state-of-the-art mobile agents using our framework, providing insights into their performance and limitations.
[314] Beyond Outlining: Heterogeneous Recursive Planning for Adaptive Long-form Writing with Language Models
Ruibin Xiong, Yimeng Chen, Dmitrii Khizbullin, Mingchen Zhuge, Jürgen Schmidhuber
Main category: cs.AI
TL;DR: WriteHERE is a general agent framework for long-form writing that achieves human-like adaptive writing through recursive task decomposition and dynamic integration of retrieval, reasoning, and composition tasks.
Details
Motivation: Current approaches rely on predefined workflows and rigid thinking patterns for writing, resulting in constrained adaptability during the writing process.Method: Proposes a planning mechanism that interleaves recursive task decomposition and execution, and integrates three fundamental task types (retrieval, reasoning, composition) to facilitate heterogeneous task decomposition.
Result: Outperforms state-of-the-art approaches across all automatic evaluation metrics in both fiction writing and technical report generation tasks.
Conclusion: The framework demonstrates effectiveness and broad applicability, with code and prompts publicly released to facilitate further research.
Abstract: Long-form writing agents require flexible integration and interaction across information retrieval, reasoning, and composition. Current approaches rely on predefined workflows and rigid thinking patterns to generate outlines before writing, resulting in constrained adaptability during writing. In this paper we propose WriteHERE, a general agent framework that achieves human-like adaptive writing through recursive task decomposition and dynamic integration of three fundamental task types: retrieval, reasoning, and composition. Our methodology features: 1) a planning mechanism that interleaves recursive task decomposition and execution, eliminating artificial restrictions on writing workflow; and 2) integration of task types that facilitates heterogeneous task decomposition. Evaluations on both fiction writing and technical report generation show that our method consistently outperforms state-of-the-art approaches across all automatic evaluation metrics, demonstrating the effectiveness and broad applicability of our proposed framework. We have publicly released our code and prompts to facilitate further research.
[315] Online Language Splatting
Saimouli Katragadda, Cho-Ying Wu, Yuliang Guo, Xinyu Huang, Guoquan Huang, Liu Ren
Main category: cs.AI
TL;DR: Online Language Splatting is the first framework to achieve online, near real-time, open-vocabulary language mapping within a 3D Gaussian Splatting SLAM system without requiring pre-generated language features, outperforming offline methods in accuracy while achieving 40x efficiency boost.
Details
Motivation: To enable AI agents to interact seamlessly with humans and 3D environments by aligning human language with 3D spatial representations in real-time, overcoming limitations of prior offline approaches that require computationally intensive preprocessing.Method: Innovative design includes: (1) high-resolution CLIP embedding module generating language features in 18ms per frame, (2) two-stage online auto-encoder compressing 768D CLIP features to 15D while preserving open-vocabulary capabilities, and (3) color-language disentangled optimization for improved rendering quality.
Result: Experimental results show the online method surpasses state-of-the-art offline methods in accuracy while achieving more than 40x efficiency boost.
Conclusion: The framework demonstrates potential for dynamic and interactive AI applications by enabling efficient fusion of high-dimensional language features into 3D representations while balancing computation speed, memory usage, rendering quality and open-vocabulary capability.
Abstract: To enable AI agents to interact seamlessly with both humans and 3D environments, they must not only perceive the 3D world accurately but also align human language with 3D spatial representations. While prior work has made significant progress by integrating language features into geometrically detailed 3D scene representations using 3D Gaussian Splatting (GS), these approaches rely on computationally intensive offline preprocessing of language features for each input image, limiting adaptability to new environments. In this work, we introduce Online Language Splatting, the first framework to achieve online, near real-time, open-vocabulary language mapping within a 3DGS-SLAM system without requiring pre-generated language features. The key challenge lies in efficiently fusing high-dimensional language features into 3D representations while balancing the computation speed, memory usage, rendering quality and open-vocabulary capability. To this end, we innovatively design: (1) a high-resolution CLIP embedding module capable of generating detailed language feature maps in 18ms per frame, (2) a two-stage online auto-encoder that compresses 768-dimensional CLIP features to 15 dimensions while preserving open-vocabulary capabilities, and (3) a color-language disentangled optimization approach to improve rendering quality. Experimental results show that our online method not only surpasses the state-of-the-art offline methods in accuracy but also achieves more than 40x efficiency boost, demonstrating the potential for dynamic and interactive AI applications.
[316] Exploring Explainable Multi-agent MCTS-minimax Hybrids in Board Game Using Process Mining
Yiyu Qian, Tim Miller, Zheng Qian, Liyuan Zhao
Main category: cs.AI
TL;DR: This paper investigates explanations for MCTS decision-making behavior and addresses its weakness of missing crucial moves by integrating shallow minimax search into rollout phases, using process mining to analyze strategies in 3v3 checkers.
Details
Motivation: MCTS agents are difficult to understand due to large, complex search trees from simulating many possible futures. The research aims to explain MCTS decision-making behavior and overcome its weakness of being highly selective and missing important moves.Method: Integrate shallow minimax search into the rollout phase of multi-agent MCTS and apply process mining techniques to analyze and explain agents’ strategies in 3v3 checkers.
Result: The approach helps address MCTS’s tactical weaknesses by combining it with full-width minimax search, providing better explanations of agent behavior through process mining analysis.
Conclusion: Combining MCTS with minimax search and process mining offers improved understanding and performance for sequential decision-making problems, particularly in multi-agent environments like 3v3 checkers.
Abstract: Monte-Carlo Tree Search (MCTS) is a family of sampling-based search algorithms widely used for online planning in sequential decision-making domains and at the heart of many recent advances in artificial intelligence. Understanding the behavior of MCTS agents is difficult for developers and users due to the frequently large and complex search trees that result from the simulation of many possible futures, their evaluations, and their relationships. This paper presents our ongoing investigation into potential explanations for the decision-making and behavior of MCTS. A weakness of MCTS is that it constructs a highly selective tree and, as a result, can miss crucial moves and fall into tactical traps. Full-width minimax search constitutes the solution. We integrate shallow minimax search into the rollout phase of multi-agent MCTS and use process mining technique to explain agents’ strategies in 3v3 checkers.
[317] Closed-loop control of seizure activity via real-time seizure forecasting by reservoir neuromorphic computing
Maryam Sadeghi, Darío Fernández Khatiboun, Yasser Rezaeiyan, Saima Rizwan, Alessandro Barcellona, Andrea Merello, Marco Crepaldi, Gabriella Panuccio, Farshad Moradi
Main category: cs.AI
TL;DR: A neuromorphic reservoir computing system for real-time seizure forecasting and personalized stimulation in drug-resistant epilepsy, achieving 83.33% forecasting accuracy and >97% seizure reduction with low-frequency stimulation.
Details
Motivation: Current closed-loop brain stimulation for epilepsy has limitations: stimulation is delivered to abort seizures rather than prevent them, and parameter tuning requires lengthy trial-and-error processes that delay therapeutic efficacy.Method: Developed a neuromorphic reservoir computing hardware system that performs real-time personalized free-run stimulations based on seizure forecasting. Each forecast triggers an electrical pulse instead of fixed-frequency stimulus trains. Validated using hippocampal spheroids coupled to 3D microelectrode arrays.
Result: The system achieved 83.33% accuracy in forecasting seizures during training and >97% seizure reduction during real-time processing, using instantaneous stimulation frequencies primarily within 20 Hz (much lower than typical clinical practice).
Conclusion: Neuromorphic systems show potential as next-generation neuromodulation strategy for personalized drug-resistant epilepsy treatment, leveraging their sparse and event-driven processing for real-time applications.
Abstract: Closed-loop brain stimulation holds potential as personalized treatment for drug-resistant epilepsy (DRE) but still suffers from limitations that result in highly variable efficacy. First, stimulation is typically delivered upon detection of the seizure to abort rather than prevent it; second, the stimulation parameters are established by trial and error, requiring lengthy rounds of fine-tuning, which delay steady-state therapeutic efficacy. Here, we address these limitations by leveraging the potential of neuromorphic computing. We present a neuromorphic reservoir computing hardware system capable of driving real-time personalized free-run stimulations based on seizure forecasting, wherein each forecast triggers an electrical pulse rather than an arbitrarily predefined fixed-frequency stimulus train. The system achieves 83.33% accuracy in forecasting seizure occurrences during the training phase. We validate the system using hippocampal spheroids coupled to 3D microelectrode array as a simplified testbed, achieving seizure reduction >97% during the real-time processing while primarily using instantaneous stimulation frequencies within 20 Hz, well below what typically used in clinical practice. Our work demonstrates the potential of neuromorphic systems as a next-generation neuromodulation strategy for personalized DRE treatment, leveraging their sparse and event-driven processing for real-time applications.
[318] Weaver: Interweaving SQL and LLM for Table Reasoning
Rohit Khoja, Devanshu Gupta, Yanjie Fu, Dan Roth, Vivek Gupta
Main category: cs.AI
TL;DR: Weaver is a modular pipeline that dynamically integrates SQL and LLMs for table-based question answering, outperforming state-of-the-art methods by decomposing complex queries into manageable subtasks.
Details
Motivation: Traditional SQL struggles with unstructured data in tables, while LLMs face limitations with long input sequences. Existing SQL-LLM approaches use rigid workflows that limit adaptability to complex queries.Method: Weaver generates flexible step-by-step plans combining SQL for structured data retrieval with LLMs for semantic processing, decomposing complex queries into manageable subtasks.
Result: Weaver consistently outperforms state-of-the-art methods across four TableQA datasets, reducing both API calls and error rates.
Conclusion: The modular pipeline approach of dynamically integrating SQL and LLMs improves accuracy and generalization for table-based question answering tasks.
Abstract: Querying tables with unstructured data is challenging due to the presence of text (or image), either embedded in the table or in external paragraphs, which traditional SQL struggles to process, especially for tasks requiring semantic reasoning. While Large Language Models (LLMs) excel at understanding context, they face limitations with long input sequences. Existing approaches that combine SQL and LLMs typically rely on rigid, predefined work-flows, limiting their adaptability to complex queries. To address these issues, we introduce Weaver , a modular pipeline that dynamically integrates SQL and LLMs for table-based question answering (TableQA). Weaver generates a flexible, step-by-step plan that combines SQL for structured data retrieval with LLMs for semantic processing. By decomposing complex queries into manageable subtasks, Weaver improves accuracy and generalization. Our experiments show that Weaver consistently outperforms state-of-the-art methods across four TableQA datasets, reducing both API calls and error rates. The code, along with other associated scripts, are available at https://coral-lab-asu.github.io/weaver.
[319] Emergent Risk Awareness in Rational Agents under Resource Constraints
Daniel Jarne Ornia, Nicholas Bishop, Joel Dyer, Wei-Chen Lee, Ani Calinescu, Doyne Farmer, Michael Wooldridge
Main category: cs.AI
TL;DR: This paper analyzes how AI agents operating under survival constraints (resource/failure limitations) develop emergent behaviors that can misalign with human objectives, and proposes mitigation mechanisms.
Details
Motivation: AI agents deployed in resource-constrained environments face implicit trade-offs that reshape their utility-driven behavior, potentially creating misalignment with human principals due to asymmetries in constraint exposure.Method: The authors formalize the problem using a survival bandit framework, provide theoretical and empirical analysis of survival-driven preference shifts, and identify conditions for misalignment emergence.
Result: The research quantifies how survival pressure leads to risk-seeking or risk-averse behaviors in AI agents and establishes conditions under which misalignment with human objectives occurs.
Conclusion: This work provides guidelines for safely deploying AI systems in critical resource-limited environments by increasing understanding of emergent behaviors under survival pressure.
Abstract: Advanced reasoning models with agentic capabilities (AI agents) are deployed to interact with humans and to solve sequential decision-making problems under (approximate) utility functions and internal models. When such problems have resource or failure constraints where action sequences may be forcibly terminated once resources are exhausted, agents face implicit trade-offs that reshape their utility-driven (rational) behaviour. Additionally, since these agents are typically commissioned by a human principal to act on their behalf, asymmetries in constraint exposure can give rise to previously unanticipated misalignment between human objectives and agent incentives. We formalise this setting through a survival bandit framework, provide theoretical and empirical results that quantify the impact of survival-driven preference shifts, identify conditions under which misalignment emerges and propose mechanisms to mitigate the emergence of risk-seeking or risk-averse behaviours. As a result, this work aims to increase understanding and interpretability of emergent behaviours of AI agents operating under such survival pressure, and offer guidelines for safely deploying such AI systems in critical resource-limited environments.
[320] Compression Strategies for Efficient Multimodal LLMs in Medical Contexts
Tanvir A. Khan, Aranya Saha, Ismam N. Swapnil, Mohammad A. Haque
Main category: cs.AI
TL;DR: This paper proposes a novel compression method for medical MLLMs that combines structural pruning with activation-aware quantization, enabling 7B parameter models to run in 4GB VRAM with 70% memory reduction and 4% performance improvement.
Details
Motivation: Multimodal Large Language Models (MLLMs) have great potential in medical applications but face high computational costs that require efficient compression techniques to make them practical for deployment.Method: The authors propose a novel layer selection method for structural pruning, analyze different quantization techniques, and evaluate performance trade-offs in a prune-SFT-quantize pipeline applied to a fine-tuned LLAVA model for medical applications.
Result: The proposed method enables MLLMs with 7B parameters to run within 4 GB of VRAM, reducing memory usage by 70% while achieving 4% higher model performance compared to traditional pruning and quantization techniques at the same compression ratio.
Conclusion: The novel compression approach combining structural pruning with activation-aware quantization effectively addresses the computational challenges of medical MLLMs, making them more practical for real-world deployment with significant memory savings and improved performance.
Abstract: Multimodal Large Language Models (MLLMs) hold huge potential for usage in the medical domain, but their computational costs necessitate efficient compression techniques. This paper evaluates the impact of structural pruning and activation-aware quantization on a fine-tuned LLAVA model for medical applications. We propose a novel layer selection method for pruning, analyze different quantization techniques, and assess the performance trade-offs in a prune-SFT-quantize pipeline. Our proposed method enables MLLMs with 7B parameters to run within 4 GB of VRAM, reducing memory usage by 70% while achieving 4% higher model performance compared to traditional pruning and quantization techniques in the same compression ratio.
[321] Plan Verification for LLM-Based Embodied Task Completion Agents
Ananth Hariharan, Vardhan Dongre, Dilek Hakkani-Tür, Gokhan Tur
Main category: cs.AI
TL;DR: An iterative verification framework using LLMs to critique and refine noisy task plans in embodied AI, improving trajectory quality while preserving error-recovery patterns.
Details
Motivation: LLM-based task plans and human demonstrations for embodied AI often contain noisy actions, redundant navigation, and logical errors that reduce policy quality.Method: Proposes an iterative framework with a Judge LLM that critiques action sequences and a Planner LLM that applies revisions, using natural language prompting for broad generalization across error types.
Result: Achieves up to 90% recall and 100% precision on TEACh dataset across four LLMs, with 96.5% of sequences converging in ≤3 iterations while improving temporal efficiency and spatial organization.
Conclusion: Establishes plan verification as a reliable LLM capability for spatial planning, providing a scalable path to higher-quality training data for imitation learning in embodied AI.
Abstract: Large language model (LLM) based task plans and corresponding human demonstrations for embodied AI may be noisy, with unnecessary actions, redundant navigation, and logical errors that reduce policy quality. We propose an iterative verification framework in which a Judge LLM critiques action sequences and a Planner LLM applies the revisions, yielding progressively cleaner and more spatially coherent trajectories. Unlike rule-based approaches, our method relies on natural language prompting, enabling broad generalization across error types including irrelevant actions, contradictions, and missing steps. On a set of manually annotated actions from the TEACh embodied AI dataset, our framework achieves up to 90% recall and 100% precision across four state-of-the-art LLMs (GPT o4-mini, DeepSeek-R1, Gemini 2.5, LLaMA 4 Scout). The refinement loop converges quickly, with 96.5% of sequences requiring at most three iterations, while improving both temporal efficiency and spatial action organization. Crucially, the method preserves human error-recovery patterns rather than collapsing them, supporting future work on robust corrective behavior. By establishing plan verification as a reliable LLM capability for spatial planning and action refinement, we provide a scalable path to higher-quality training data for imitation learning in embodied AI.
[322] CogAtom: From Cognitive Atoms to Olympiad-level Mathematical Reasoning in Large Language Models
Zhuofan Chen, Jiyuan He, Yichi Zhang, Xing Hu, Haoxing Wen, Jun Bai, Wenge Rong
Main category: cs.AI
TL;DR: CogAtom is a cognitive atom-based framework for generating high-quality, diverse math problems by recombining fundamental reasoning units extracted from human solutions, enabling scalable problem synthesis with controllable difficulty.
Details
Motivation: The scarcity of Olympiad-level math problems limits test-time scaling techniques for LLMs, which need challenging problems to improve mathematical reasoning capabilities.Method: Models problem construction as selecting and recombining cognitive atoms (fundamental reasoning units) using a diversity-promoting random walk algorithm and constraint-based recombination to ensure logical soundness.
Result: CogAtom outperforms existing methods in accuracy, reasoning depth, and diversity, generating problems that match AIME difficulty while exceeding it in structural variation.
Conclusion: Provides a cognitively grounded pathway for scalable, high-quality math problem generation, addressing the bottleneck in Olympiad-level problem availability for LLM testing.
Abstract: Mathematical reasoning poses significant challenges for Large Language Models (LLMs) due to its demand for multi-step reasoning and abstract conceptual integration. While recent test-time scaling techniques rely heavily on high-quality, challenging problems, the scarcity of Olympiad-level math problems remains a bottleneck. We introduce CogAtom, a novel cognitive atom-based framework for synthesizing mathematically rigorous and cognitively diverse problems. Unlike prior approaches, CogAtom models problem construction as a process of selecting and recombining fundamental reasoning units, cognitive atoms, extracted from human-authored solutions. A diversity-promoting random walk algorithm enables exploration of the cognitive atom space, while a constraint-based recombination mechanism ensures logical soundness and structural validity. The combinatorial nature of the graph structure provides a near-infinite space of reasoning paths, and the walk algorithm systematically explores this space to achieve large-scale synthesis of high-quality problems; meanwhile, by controlling the number of cognitive atoms, we can precisely adjust problem difficulty, ensuring diversity, scalability, and controllability of the generated problems. Experimental results demonstrate that CogAtom outperforms existing methods in accuracy, reasoning depth, and diversity, generating problems that closely match the difficulty of AIME while exceeding it in structural variation. Our work offers a cognitively grounded pathway toward scalable, high-quality math problem generation.Our code is publicly available at https://github.com/Icarus-1111/CogAtom.
[323] Similarity Field Theory: A General Mathematical Framework for Intelligence
Kei-Sing Ng
Main category: cs.AI
TL;DR: Similarity Field Theory is a mathematical framework that formalizes similarity relations and their evolution, providing a structural basis for dynamic systems and a generative definition of intelligence.
Details
Motivation: To establish a foundational mathematical framework for understanding similarity relations in dynamic systems and formalize what constitutes intelligent behavior through similarity preservation.Method: Defines similarity fields over entities, system evolution sequences, concepts as fibers, and generative operators. Proves theorems on asymmetry and stability constraints.
Result: Developed Similarity Field Theory with mathematical formalizations, proved key theorems about system constraints, and applied the framework to analyze large language models as probes of societal cognition.
Conclusion: The theory provides a foundational language for characterizing intelligent systems, with applications to understanding AI systems like large language models through similarity-based analysis.
Abstract: We posit that persisting and transforming similarity relations form the structural basis of any comprehensible dynamic system. This paper introduces Similarity Field Theory, a mathematical framework that formalizes the principles governing similarity values among entities and their evolution. We define: (1) a similarity field $S: U \times U \to [0,1]$ over a universe of entities $U$, satisfying reflexivity $S(E,E)=1$ and treated as a directed relational field (asymmetry and non-transitivity are allowed); (2) the evolution of a system through a sequence $Z_p = (X_p, S^{(p)})$ indexed by $p=0,1,2,\ldots$; (3) concepts $K$ as entities that induce fibers $F_{\alpha}(K) = { E \in U \mid S(E,K) \ge \alpha }$, i.e., superlevel sets of the unary map $S_K(E) := S(E,K)$; and (4) a generative operator $G$ that produces new entities. Within this framework, we formalize a generative definition of intelligence: an operator $G$ is intelligent with respect to a concept $K$ if, given a system containing entities belonging to the fiber of $K$, it generates new entities that also belong to that fiber. Similarity Field Theory thus offers a foundational language for characterizing, comparing, and constructing intelligent systems. We prove two theorems: (i) asymmetry blocks mutual inclusion; and (ii) stability requires either an anchor coordinate or eventual confinement within a level set. These results ensure that the evolution of similarity fields is both constrained and interpretable, culminating in an exploration of how the framework allows us to interpret large language models and present empirical results using large language models as experimental probes of societal cognition.
[324] MAPO: Mixed Advantage Policy Optimization
Wenke Huang, Quan Zhang, Yiyang Fang, Jian Liang, Xuankun Rong, Huanjin Yao, Guancheng Wan, Ke Liang, Wenwen He, Mingjun Li, Leszek Rutkowski, Mang Ye, Bo Du, Dacheng Tao
Main category: cs.AI
TL;DR: Proposes Mixed Advantage Policy Optimization (MAPO) to address advantage reversion and mirror problems in GRPO by dynamically reweighting advantage functions based on trajectory certainty.
Details
Motivation: Existing GRPO methods suffer from advantage reversion and advantage mirror problems that hinder reasonable advantage allocation across query samples.Method: Introduces advantage percent deviation for high-certainty trajectories and dynamically reweights advantage function based on trajectory certainty to adapt to sample-specific characteristics.
Result: Comparison with state-of-the-art methods and ablation studies validate the effectiveness of MAPO approach.
Conclusion: MAPO provides an effective solution to the advantage allocation problems in GRPO through dynamic advantage reweighting based on trajectory certainty.
Abstract: Recent advances in reinforcement learning for foundation models, such as Group Relative Policy Optimization (GRPO), have significantly improved the performance of foundation models on reasoning tasks. Notably, the advantage function serves as a central mechanism in GRPO for ranking the trajectory importance. However, existing explorations encounter both advantage reversion and advantage mirror problems, which hinder the reasonable advantage allocation across different query samples. In this work, we propose an easy but effective GRPO strategy, Mixed Advantage Policy Optimization (MAPO). We reveal that the trajectory appears with different certainty and propose the advantage percent deviation for samples with high-certainty trajectories. Furthermore, we dynamically reweight the advantage function for samples with varying trajectory certainty, thereby adaptively configuring the advantage function to account for sample-specific characteristics. Comparison with related state-of-the-art methods, along with ablation studies on different advantage variants, validates the effectiveness of our approach.
cs.SD
[325] MusiCRS: Benchmarking Audio-Centric Conversational Recommendation
Rohan Surana, Amit Namburi, Gagan Mundada, Abhay Lal, Zachary Novack, Julian McAuley, Junda Wu
Main category: cs.SD
TL;DR: MusiCRS is the first benchmark for audio-centric conversational music recommendation that links real Reddit conversations with audio tracks, revealing current systems’ limitations in audio reasoning.
Details
Motivation: Music recommendation requires reasoning over audio content beyond text/metadata, but current LLM-based systems struggle with nuanced audio understanding and cross-modal integration.Method: Created a benchmark with 477 conversations spanning diverse music genres, 3,589 musical entities, and audio grounding via YouTube links. Evaluated systems across three input modalities: audio-only, query-only, and multimodal.
Result: Current systems rely heavily on textual signals and struggle with audio reasoning, exposing limitations in cross-modal knowledge integration where models understand dialogue but cannot ground musical concepts in audio.
Conclusion: The MusiCRS benchmark, dataset, and evaluation code are released to facilitate progress in audio-centric conversational recommendation systems.
Abstract: Conversational recommendation has advanced rapidly with large language models (LLMs), yet music remains a uniquely challenging domain where effective recommendations require reasoning over audio content beyond what text or metadata can capture. We present MusiCRS, the first benchmark for audio-centric conversational recommendation that links authentic user conversations from Reddit with corresponding audio tracks. MusiCRS contains 477 high-quality conversations spanning diverse genres (classical, hip-hop, electronic, metal, pop, indie, jazz) with 3,589 unique musical entities and audio grounding via YouTube links. MusiCRS enables evaluation across three input modality configurations: audio-only, query-only, and audio+query (multimodal), allowing systematic comparison of audio-LLMs, retrieval models, and traditional approaches. Our experiments reveal that current systems rely heavily on textual signals and struggle with nuanced audio reasoning. This exposes fundamental limitations in cross-modal knowledge integration where models excel at dialogue semantics but cannot effectively ground abstract musical concepts in actual audio content. To facilitate progress, we release the MusiCRS dataset (https://huggingface.co/datasets/rohan2810/MusiCRS), evaluation code (https://github.com/rohan2810/musiCRS), and comprehensive baselines.
[326] ArtiFree: Detecting and Reducing Generative Artifacts in Diffusion-based Speech Enhancement
Bhawana Chhaglani, Yang Gao, Julius Richter, Xilin Li, Syavosh Zadissa, Tarun Pruthi, Andrew Lovitt
Main category: cs.SD
TL;DR: This paper studies artifact prediction and reduction in diffusion-based speech enhancement, proposing an ensemble inference method using semantic consistency to reduce phonetic errors and improve speech quality.
Details
Motivation: Diffusion-based speech enhancement achieves natural-sounding speech but suffers from generative artifacts and high inference latency, which this work aims to address.Method: The authors use variance in speech embeddings to predict phonetic errors and propose an ensemble inference method guided by semantic consistency across multiple diffusion runs, with adaptive diffusion steps to balance performance and latency.
Result: The proposed method reduces WER by 15% in low-SNR conditions, effectively improving phonetic accuracy and semantic plausibility while balancing artifact suppression and latency.
Conclusion: Semantic priors are a powerful tool to guide generative speech enhancement toward artifact-free outputs, with ensemble inference and adaptive diffusion steps providing effective solutions to current limitations.
Abstract: Diffusion-based speech enhancement (SE) achieves natural-sounding speech and strong generalization, yet suffers from key limitations like generative artifacts and high inference latency. In this work, we systematically study artifact prediction and reduction in diffusion-based SE. We show that variance in speech embeddings can be used to predict phonetic errors during inference. Building on these findings, we propose an ensemble inference method guided by semantic consistency across multiple diffusion runs. This technique reduces WER by 15% in low-SNR conditions, effectively improving phonetic accuracy and semantic plausibility. Finally, we analyze the effect of the number of diffusion steps, showing that adaptive diffusion steps balance artifact suppression and latency. Our findings highlight semantic priors as a powerful tool to guide generative SE toward artifact-free outputs.
[327] Thinking While Listening: Simple Test Time Scaling For Audio Classification
Prateek Verma, Mert Pilanci
Main category: cs.SD
TL;DR: A framework that enables neural models to ’think while listening’ to everyday sounds, improving audio classification through reasoning capabilities inspired by large language models.
Details
Motivation: To enhance audio classification performance by incorporating reasoning capabilities similar to those in large language models, addressing how to integrate thinking into audio pipelines and design new architectures that support both thinking and test-time scaling.Method: Proposes two approaches: incorporating thinking into existing audio classification pipelines for reasoning in category space, and designing new architectures from scratch that support both thinking and test-time scaling. Also evaluates lightweight approaches like retraining only the embedding matrix of frozen smaller models.
Result: Models exhibit improved classification accuracy in both settings. Test-time scaling shows consistent gains as the number of sampled traces increases. Lightweight approaches (retraining only embedding matrix of frozen GPT-2) can surpass performance of billion-parameter text-based reasoning models.
Conclusion: The framework successfully enables neural models to ’think while listening’ to everyday sounds, demonstrating that reasoning capabilities can be effectively incorporated into audio classification systems, with lightweight approaches showing competitive performance against larger models.
Abstract: We propose a framework that enables neural models to “think while listening” to everyday sounds, thereby enhancing audio classification performance. Motivated by recent advances in the reasoning capabilities of large language models, we address two central questions: (i) how can thinking be incorporated into existing audio classification pipelines to enable reasoning in the category space and improve performance, and (ii) can a new architecture be designed from the ground up to support both thinking and test-time scaling? We demonstrate that in both settings, our models exhibit improved classification accuracy. Leveraging test-time scaling, we observe consistent gains as the number of sampled traces increases. Furthermore, we evaluate two open-source reasoning models, GPT-OSS-20B and Qwen3-14B, showing that while such models are capable of zero-shot reasoning, a lightweight approach–retraining only the embedding matrix of a frozen, smaller model like GPT-2–can surpass the performance of billion-parameter text-based reasoning models.
[328] Can Audio Large Language Models Verify Speaker Identity?
Yiming Ren, Xuenan Xu, Baoxiang Li, Shuai Wang, Chao Zhang
Main category: cs.SD
TL;DR: This paper adapts Audio Large Language Models (ALLMs) for speaker verification by reformulating it as an audio question-answering task, addresses zero-shot limitations through supervised fine-tuning with hard pair sampling, and extends to text-dependent speaker verification.
Details
Motivation: To explore the potential of ALLMs as a unified model for robust speaker verification systems while maintaining general audio understanding capabilities, addressing the limitations of current ALLMs in zero-shot speaker verification under diverse acoustic conditions.Method: Reformulate speaker verification as audio question-answering, conduct zero-shot evaluations, perform supervised fine-tuning with a rule-based hard pair sampling strategy for challenging training pairs, and extend to text-dependent speaker verification by joint verification of speaker identity and spoken content.
Result: Zero-shot evaluations show limited SV capability in ALLMs; lightweight fine-tuning substantially improves performance but still lags behind conventional models; text-dependent SV yields competitive results with cascaded ASR-SV systems.
Conclusion: With proper adaptation, ALLMs hold substantial potential as unified models for robust speaker verification systems while maintaining general audio understanding capabilities, though there remains a performance gap with conventional approaches.
Abstract: This paper investigates adapting Audio Large Language Models (ALLMs) for speaker verification (SV). We reformulate SV as an audio question-answering task and conduct comprehensive zero-shot evaluations on public benchmarks, showing that current ALLMs have limited zero-shot SV capability and often struggle in diverse acoustic conditions. To address this challenge, we perform supervised fine-tuning on speaker verification data. A rule-based hard pair sampling strategy is proposed to construct more challenging training pairs. Lightweight fine-tuning substantially improves the performance, though there is still a gap between ALLMs and conventional models. Then, we extend to text-dependent SV by jointly querying ALLMs to verify speaker identity and spoken content, yielding results competitive with cascaded ASR-SV systems. Our findings demonstrate that with proper adaptation, ALLMs hold substantial potential as a unified model for robust speaker verification systems, while maintaining the general audio understanding capabilities.
[329] Efficient Speech Watermarking for Speech Synthesis via Progressive Knowledge Distillation
Yang Cui, Peter Pan, Lei He, Sheng Zhao
Main category: cs.SD
TL;DR: PKDMark is a lightweight deep learning-based speech watermarking method that uses progressive knowledge distillation to achieve 93.6% computational cost reduction while maintaining high robustness and imperceptibility.
Details
Motivation: Unauthorized voice cloning poses privacy and security risks. Current speech watermarking methods face trade-offs: DSP-based methods are efficient but vulnerable to attacks, while deep learning-based methods are robust but computationally expensive.Method: Two-stage approach: (1) train a high-performance teacher model using invertible neural network architecture, (2) transfer capabilities to a compact student model through progressive knowledge distillation.
Result: The distilled model achieves 99.6% average detection F1 score with PESQ of 4.30 in advanced distortions, enabling efficient real-time speech synthesis applications.
Conclusion: PKDMark successfully bridges the gap between computational efficiency and robustness in speech watermarking, making it suitable for practical real-time applications.
Abstract: With the rapid advancement of speech generative models, unauthorized voice cloning poses significant privacy and security risks. Speech watermarking offers a viable solution for tracing sources and preventing misuse. Current watermarking technologies fall mainly into two categories: DSP-based methods and deep learning-based methods. DSP-based methods are efficient but vulnerable to attacks, whereas deep learning-based methods offer robust protection at the expense of significantly higher computational cost. To improve the computational efficiency and enhance the robustness, we propose PKDMark, a lightweight deep learning-based speech watermarking method that leverages progressive knowledge distillation (PKD). Our approach proceeds in two stages: (1) training a high-performance teacher model using an invertible neural network-based architecture, and (2) transferring the teacher’s capabilities to a compact student model through progressive knowledge distillation. This process reduces computational costs by 93.6% while maintaining high level of robust performance and imperceptibility. Experimental results demonstrate that our distilled model achieves an average detection F1 score of 99.6% with a PESQ of 4.30 in advanced distortions, enabling efficient speech watermarking for real-time speech synthesis applications.
[330] Eliminating stability hallucinations in llm-based tts models via attention guidance
ShiMing Wang, ZhiHao Du, Yang Xiang, TianYu Zhao, Han Zhao, Qian Chen, XianGang Li, HanJie Guo, ZhenHua Ling
Main category: cs.SD
TL;DR: This paper proposes methods to reduce stability hallucinations in LLM-based TTS models by improving attention mechanisms, using Optimal Alignment Score (OAS) and chain-of-thought guidance.
Details
Motivation: To address stability hallucinations (repetitive or omitted speech) in LLM-based Text-to-Speech systems by enhancing the attention alignment between text and speech tokens.Method: 1) Analyzed text-speech alignment mechanisms in LLMs; 2) Proposed Optimal Alignment Score (OAS) using Viterbi algorithm to evaluate alignment quality; 3) Integrated OAS into CosyVoice2 training; 4) Used pre-trained attention values with chain-of-thought guidance to train student model.
Result: Experiments on Seed-TTS-Eval and CV3-Eval test sets showed effective reduction of stability hallucinations in CosyVoice2 without negative side effects.
Conclusion: The proposed methods successfully mitigate stability hallucinations in LLM-based TTS systems through improved attention mechanisms and alignment evaluation.
Abstract: This paper focuses on resolving stability hallucinations (e.g., repetitive or omitted speech) in LLM-based Text-to-Speech (TTS) models by improving and leveraging the attention mechanism. First, we analyzed the alignment mechanism between text tokens and speech tokens in LLMs. We then proposed a metric termed the Optimal Alignment Score (OAS), which employs the Viterbi algorithm to evaluate text-speech alignment quality. Subsequently, OAS was integrated into the training of CosyVoice2 to assist LLMs in learning continuous, stable alignment. Additionally, the pre-trained attention value is employed to guide the training of the student CosyVoice2 via chain-of-thought (CoT), which further reduces stability hallucinations in synthesized speech. Experiments on the Seed-TTS-Eval and CV3-Eval test sets demonstrate that the proposed methods can effectively reduce the stability hallucinations of CosyVoice2 without introducing additional negative effects. The appendix is available at https://wsmzzz.github.io/llm_attn.
[331] SEA-Spoof: Bridging The Gap in Multilingual Audio Deepfake Detection for South-East Asian
Jinyang Wu, Nana Hou, Zihan Pan, Qiquan Zhang, Sailor Hardik Bhupendra, Soumik Mondal
Main category: cs.SD
TL;DR: SEA-Spoof is the first large-scale Audio Deepfake Detection dataset for South-East Asian languages, addressing the performance gap in detection models when applied to SEA languages due to data scarcity and language-specific characteristics.
Details
Motivation: The rapid growth of South-East Asia's digital economy has increased audio deepfake risks, but current datasets poorly cover SEA languages, causing detection models trained on high-resource languages to fail when applied to SEA languages.Method: Created SEA-Spoof dataset with 300+ hours of paired real and spoof speech across Tamil, Hindi, Thai, Indonesian, Malay, and Vietnamese, using diverse state-of-the-art open-source and commercial synthesis systems to capture wide variability.
Result: Benchmarking showed severe cross-lingual degradation of detection models, but fine-tuning on SEA-Spoof dramatically restored performance across languages and synthesis sources.
Conclusion: The results highlight the urgent need for SEA-focused research and establish SEA-Spoof as a foundation for developing robust, cross-lingual, and fraud-resilient detection systems.
Abstract: The rapid growth of the digital economy in South-East Asia (SEA) has amplified the risks of audio deepfakes, yet current datasets cover SEA languages only sparsely, leaving models poorly equipped to handle this critical region. This omission is critical: detection models trained on high-resource languages collapse when applied to SEA, due to mismatches in synthesis quality, language-specific characteristics, and data scarcity. To close this gap, we present SEA-Spoof, the first large-scale Audio Deepfake Detection (ADD) dataset especially for SEA languages. SEA-Spoof spans 300+ hours of paired real and spoof speech across Tamil, Hindi, Thai, Indonesian, Malay, and Vietnamese. Spoof samples are generated from a diverse mix of state-of-the-art open-source and commercial systems, capturing wide variability in style and fidelity. Benchmarking state-of-the-art detection models reveals severe cross-lingual degradation, but fine-tuning on SEA-Spoof dramatically restores performance across languages and synthesis sources. These results highlight the urgent need for SEA-focused research and establish SEA-Spoof as a foundation for developing robust, cross-lingual, and fraud-resilient detection systems.
[332] CoMelSinger: Discrete Token-Based Zero-Shot Singing Synthesis With Structured Melody Control and Guidance
Junchuan Zhao, Wei Zeng, Tianle Lyu, Ye Wang
Main category: cs.SD
TL;DR: CoMelSinger is a zero-shot singing voice synthesis framework that enables structured melody control while preventing prosody leakage through contrastive learning and singing voice transcription.
Details
Motivation: Direct extension of discrete codec-based speech synthesis techniques to SVS is challenging due to precise melody control requirements and prosody leakage issues where pitch information gets entangled in timbre prompts.Method: Built on non-autoregressive MaskGCT architecture, CoMelSinger uses lyric and pitch tokens instead of text inputs. It employs coarse-to-fine contrastive learning to suppress prosody leakage and incorporates a lightweight encoder-only Singing Voice Transcription module for frame-level supervision.
Result: Experimental results show CoMelSinger achieves notable improvements in pitch accuracy, timbre consistency, and zero-shot transferability over competitive baselines.
Conclusion: CoMelSinger successfully enables structured and disentangled melody control in zero-shot SVS while maintaining in-context generalization capabilities.
Abstract: Singing Voice Synthesis (SVS) aims to generate expressive vocal performances from structured musical inputs such as lyrics and pitch sequences. While recent progress in discrete codec-based speech synthesis has enabled zero-shot generation via in-context learning, directly extending these techniques to SVS remains non-trivial due to the requirement for precise melody control. In particular, prompt-based generation often introduces prosody leakage, where pitch information is inadvertently entangled within the timbre prompt, compromising controllability. We present CoMelSinger, a zero-shot SVS framework that enables structured and disentangled melody control within a discrete codec modeling paradigm. Built on the non-autoregressive MaskGCT architecture, CoMelSinger replaces conventional text inputs with lyric and pitch tokens, preserving in-context generalization while enhancing melody conditioning. To suppress prosody leakage, we propose a coarse-to-fine contrastive learning strategy that explicitly regularizes pitch redundancy between the acoustic prompt and melody input. Furthermore, we incorporate a lightweight encoder-only Singing Voice Transcription (SVT) module to align acoustic tokens with pitch and duration, offering fine-grained frame-level supervision. Experimental results demonstrate that CoMelSinger achieves notable improvements in pitch accuracy, timbre consistency, and zero-shot transferability over competitive baselines.
[333] Enabling Multi-Species Bird Classification on Low-Power Bioacoustic Loggers
Stefano Ciapponi, Leonardo Mannini, Jarek Scanferla, Matteo Anderle, Elisabetta Farella
Main category: cs.SD
TL;DR: WrenNet is an efficient neural network for real-time multi-species bird audio classification on low-power microcontrollers, achieving high accuracy with minimal energy consumption.
Details
Motivation: To enable scalable biodiversity monitoring through continuous, multi-species acoustic monitoring on low-power edge devices, addressing the need for practical and energy-efficient solutions in ecological research.Method: Proposes a semi-learnable spectral feature extractor optimized for avian vocalizations, deployed on microcontrollers like AudioMoth and Raspberry Pi 3B+ for real-time classification.
Result: Achieves up to 90.8% accuracy on acoustically distinctive species and 70.1% on the full 70-species dataset, consuming only 77mJ per inference on AudioMoth and being over 16x more energy-efficient than Birdnet on Raspberry Pi 3B+.
Conclusion: WrenNet demonstrates the first practical framework for continuous, multi-species acoustic monitoring on low-power edge devices, offering significant improvements in energy efficiency and deployment feasibility for biodiversity monitoring.
Abstract: This paper introduces WrenNet, an efficient neural network enabling real-time multi-species bird audio classification on low-power microcontrollers for scalable biodiversity monitoring. We propose a semi-learnable spectral feature extractor that adapts to avian vocalizations, outperforming standard mel-scale and fully-learnable alternatives. On an expert-curated 70-species dataset, WrenNet achieves up to 90.8% accuracy on acoustically distinctive species and 70.1% on the full task. When deployed on an AudioMoth device ($\leq$1MB RAM), it consumes only 77mJ per inference. Moreover, the proposed model is over 16x more energy-efficient compared to Birdnet when running on a Raspberry Pi 3B+. This work demonstrates the first practical framework for continuous, multi-species acoustic monitoring on low-power edge devices.
[334] Unifying Symbolic Music Arrangement: Track-Aware Reconstruction and Structured Tokenization
Longshen Ou, Jingwei Zhao, Ziyu Wang, Gus Xia, Qihao Liang, Torin Hopkins Ye Wang
Main category: cs.SD
TL;DR: A unified framework for automatic multitrack music arrangement using a single pre-trained model that handles diverse scenarios like reinterpretation, simplification, and additive generation through segment-level reconstruction with disentangled content and style tokens.
Details
Motivation: To create a general-purpose symbolic music arrangement system that can handle multiple arrangement scenarios without requiring task-specific models, enabling flexible any-to-any instrumentation transformations.Method: Uses a segment-level reconstruction objective with token-level disentangled content and style representations, along with REMI-z structured tokenization for multitrack symbolic music to enhance modeling efficiency.
Result: Outperforms task-specific state-of-the-art models on band arrangement, piano reduction, and drum arrangement tasks in both objective metrics and perceptual evaluations.
Conclusion: The framework demonstrates strong generality and suggests broader applicability in symbolic music-to-music transformation, providing a unified solution for diverse music arrangement needs.
Abstract: We present a unified framework for automatic multitrack music arrangement that enables a single pre-trained symbolic music model to handle diverse arrangement scenarios, including reinterpretation, simplification, and additive generation. At its core is a segment-level reconstruction objective operating on token-level disentangled content and style, allowing for flexible any-to-any instrumentation transformations at inference time. To support track-wise modeling, we introduce REMI-z, a structured tokenization scheme for multitrack symbolic music that enhances modeling efficiency and effectiveness for both arrangement tasks and unconditional generation. Our method outperforms task-specific state-of-the-art models on representative tasks in different arrangement scenarios – band arrangement, piano reduction, and drum arrangement, in both objective metrics and perceptual evaluations. Taken together, our framework demonstrates strong generality and suggests broader applicability in symbolic music-to-music transformation.
[335] Stylus: Repurposing Stable Diffusion for Training-Free Music Style Transfer on Mel-Spectrograms
Heehwan Wang, Joonwoo Kwon, Sooyoung Kim, Jungwoo Seo, Shinjae Yoo, Yuewei Lin, Jiook Cha
Main category: cs.SD
TL;DR: Stylus is a training-free framework that repurposes pre-trained Stable Diffusion for music style transfer by manipulating self-attention to inject style features while preserving musical structure, achieving superior results without additional training.
Details
Motivation: Existing music style transfer approaches require paired datasets, extensive training, or detailed annotations, which limits their practicality and accessibility.Method: Stylus manipulates self-attention in pre-trained Stable Diffusion by injecting style key-value features while preserving source queries. It uses phase-preserving reconstruction to avoid artifacts and classifier-free-guidance-inspired control for adjustable stylization.
Result: Stylus outperforms state-of-the-art baselines with 34.1% higher content preservation and 25.7% better perceptual quality without any additional training.
Conclusion: The framework demonstrates that effective music style transfer can be achieved through training-free adaptation of existing diffusion models, making the technology more accessible and practical.
Abstract: Music style transfer enables personalized music creation by blending the structure of a source with the stylistic attributes of a reference. Existing text-conditioned and diffusion-based approaches show promise but often require paired datasets, extensive training, or detailed annotations. We present Stylus, a training-free framework that repurposes a pre-trained Stable Diffusion model for music style transfer in the mel-spectrogram domain. Stylus manipulates self-attention by injecting style key-value features while preserving source queries to maintain musical structure. To improve fidelity, we introduce a phase-preserving reconstruction strategy that avoids artifacts from Griffin-Lim reconstruction, and we adopt classifier-free-guidance-inspired control for adjustable stylization and multi-style blending. In extensive evaluations, Stylus outperforms state-of-the-art baselines, achieving 34.1% higher content preservation and 25.7% better perceptual quality without any additional training.
cs.LG
[336] Wavelet Fourier Diffuser: Frequency-Aware Diffusion Model for Reinforcement Learning
Yifu Luo, Yongzhe Chang, Xueqian Wang
Main category: cs.LG
TL;DR: WFDiffuser is a novel diffusion-based RL framework that addresses frequency shift issues in trajectory modeling by integrating wavelet and Fourier transforms to handle both low- and high-frequency components.
Details
Motivation: Existing diffusion-based RL approaches focus only on time-domain features, causing frequency shift that leads to trajectory instability and degraded performance. The paper aims to solve this by analyzing RL from a frequency-domain perspective.Method: Proposes WFDiffuser which uses Discrete Wavelet Transform to decompose trajectories into frequency components, then employs Short-Time Fourier Transform and cross attention mechanisms for frequency feature extraction and cross-frequency interaction.
Result: Extensive experiments on D4RL benchmark show WFDiffuser effectively mitigates frequency shift, producing smoother trajectories and improved decision-making performance over existing methods.
Conclusion: The frequency-domain perspective and WFDiffuser framework successfully address frequency shift issues in diffusion-based RL, demonstrating superior performance through better trajectory modeling.
Abstract: Diffusion probability models have shown significant promise in offline reinforcement learning by directly modeling trajectory sequences. However, existing approaches primarily focus on time-domain features while overlooking frequency-domain features, leading to frequency shift and degraded performance according to our observation. In this paper, we investigate the RL problem from a new perspective of the frequency domain. We first observe that time-domain-only approaches inadvertently introduce shifts in the low-frequency components of the frequency domain, which results in trajectory instability and degraded performance. To address this issue, we propose Wavelet Fourier Diffuser (WFDiffuser), a novel diffusion-based RL framework that integrates Discrete Wavelet Transform to decompose trajectories into low- and high-frequency components. To further enhance diffusion modeling for each component, WFDiffuser employs Short-Time Fourier Transform and cross attention mechanisms to extract frequency-domain features and facilitate cross-frequency interaction. Extensive experiment results on the D4RL benchmark demonstrate that WFDiffuser effectively mitigates frequency shift, leading to smoother, more stable trajectories and improved decision-making performance over existing methods.
[337] Anti-Money Laundering Systems Using Deep Learning
Mashkhal Abdalwahid Sidiq, Yimamu Kirubel Wondaferew
Main category: cs.LG
TL;DR: This paper proposes using deep learning with centrality algorithms for money laundering detection, showing superiority over traditional rule-based AML systems by analyzing transaction networks contextually.
Details
Motivation: Traditional AML systems have high false positive rates and lack sophistication to detect complex money laundering schemes, creating a need for more advanced detection methods.Method: The paper implements an advanced AML system using deep learning techniques with centrality algorithms (Degree Centrality, Closeness Centrality, Betweenness Centrality, PageRank) for link analysis in financial transaction networks.
Result: The GCN (Graph Convolutional Network) model demonstrated practicality and superiority for connectively structured data, analyzing transactions in their financial context rather than in isolation.
Conclusion: Integrating deep learning with centrality algorithms shows promise for enhancing AML system effectiveness by improving detection capabilities for complex money laundering patterns.
Abstract: In this paper, we focused on using deep learning methods for detecting money laundering in financial transaction networks, in order to demonstrate that it can be used as a complement or instead of the more commonly used rule-based systems and conventional Anti-Money Laundering (AML) systems. The paper explores the pivotal role played by Anti-Money Laundering (AML) activities in the global financial industry. It underscores the drawbacks of conventional AML systems, which exhibit high rates of false positives and lack the sophistication to uncover intricate money laundering schemes. To tackle these challenges, the paper proposes an advanced AML system that capitalizes on link analysis using deep learning techniques. At the heart of this system lies the utilization of centrality algorithms like Degree Centrality, Closeness Centrality, Betweenness Centrality, and PageRank. These algorithms enhance the system’s capability to identify suspicious activities by examining the influence and interconnections within networks of financial transactions. The significance of Anti-Money Laundering (AML) efforts within the global financial sector is discussed in this paper. It highlights the limitations of traditional AML systems. The results showed the practicality and superiority of the new implementation of the GCN model, which is a preferable method for connectively structured data, meaning that a transaction or account is analyzed in the context of its financial environment. In addition, the paper delves into the prospects of Anti-Money Laundering (AML) efforts, proposing the integration of emerging technologies such as deep learning and centrality algorithms. This integration holds promise for enhancing the effectiveness of AML systems by refining their capabilities.
[338] DeepACTIF: Efficient Feature Attribution via Activation Traces in Neural Sequence Models
Benedikt W. Hosp
Main category: cs.LG
TL;DR: DeepACTIF is a lightweight, architecture-aware feature attribution method for time-series models that uses internal LSTM activations to efficiently estimate feature importance, outperforming traditional methods like SHAP and IG in speed and accuracy.
Details
Motivation: Standard attribution methods are computationally intensive and unsuitable for real-time applications in time-series domains like healthcare and biometrics, creating a need for efficient interpretability solutions.Method: Uses inverse-weighted aggregation of LSTM activations across time steps, focusing on stability and magnitude of activations to estimate feature importance efficiently.
Result: Outperforms SHAP, IG, and DeepLIFT in accuracy and robustness, preserves predictive performance with top 10% features, reduces computation time/memory usage by orders of magnitude.
Conclusion: DeepACTIF enables real-time interpretability on edge devices, making it suitable for mobile XR headsets and embedded health monitors.
Abstract: Feature attribution is essential for interpreting deep learning models, particularly in time-series domains such as healthcare, biometrics, and human-AI interaction. However, standard attribution methods, such as Integrated Gradients or SHAP, are computationally intensive and not well-suited for real-time applications. We present DeepACTIF, a lightweight and architecture-aware feature attribution method that leverages internal activations of sequence models to estimate feature importance efficiently. Focusing on LSTM-based networks, we introduce an inverse-weighted aggregation scheme that emphasises stability and magnitude of activations across time steps. Our evaluation across three biometric gaze datasets shows that DeepACTIF not only preserves predictive performance under severe feature reduction (top 10% of features) but also significantly outperforms established methods, including SHAP, IG, and DeepLIFT, in terms of both accuracy and statistical robustness. Using Wilcoxon signed-rank tests and effect size analysis, we demonstrate that DeepACTIF yields more informative feature rankings with significantly lower error across all top-k conditions (10 - 40%). Our experiments demonstrate that DeepACTIF not only reduces computation time and memory usage by orders of magnitude but also preserves model accuracy when using only top-ranked features. That makes DeepACTIF a viable solution for real-time interpretability on edge devices such as mobile XR headsets or embedded health monitors.
[339] Analyzing the Impact of Credit Card Fraud on Economic Fluctuations of American Households Using an Adaptive Neuro-Fuzzy Inference System
Zhuqi Wang, Qinghe Zhang, Zhuopei Cheng
Main category: cs.LG
TL;DR: A hybrid credit card fraud detection method using Enhanced ANFIS with wavelet decomposition and temporal attention mechanisms to improve detection accuracy.
Details
Motivation: Credit card fraud is a growing threat to household finances, causing unpredictable economic behavior changes that need better detection methods.Method: Enhanced ANFIS model with multi-resolution wavelet decomposition on transaction data and macroeconomic indicators, deep fuzzy rule library with adaptive Gaussian membership functions, and temporal attention mechanism for weighting multi-scale economic patterns.
Result: Experimental results show 17.8% reduction in RMSE compared to local neuro-fuzzy models and conventional LSTM models.
Conclusion: The proposed hybrid method significantly improves fraud detection accuracy by integrating wavelet analysis, fuzzy inference, and temporal attention mechanisms.
Abstract: Credit card fraud is assuming growing proportions as a major threat to the financial position of American household, leading to unpredictable changes in household economic behavior. To solve this problem, in this paper, a new hybrid analysis method is presented by using the Enhanced ANFIS. The model proposes several advances of the conventional ANFIS framework and employs a multi-resolution wavelet decomposition module and a temporal attention mechanism. The model performs discrete wavelet transformations on historical transaction data and macroeconomic indicators to generate localized economic shock signals. The transformed features are then fed into a deep fuzzy rule library which is based on Takagi-Sugeno fuzzy rules with adaptive Gaussian membership functions. The model proposes a temporal attention encoder that adaptively assigns weights to multi-scale economic behavior patterns, increasing the effectiveness of relevance assessment in the fuzzy inference stage and enhancing the capture of long-term temporal dependencies and anomalies caused by fraudulent activities. The proposed method differs from classical ANFIS which has fixed input-output relations since it integrates fuzzy rule activation with the wavelet basis selection and the temporal correlation weights via a modular training procedure. Experimental results show that the RMSE was reduced by 17.8% compared with local neuro-fuzzy models and conventional LSTM models.
[340] Unsupervised Outlier Detection in Audit Analytics: A Case Study Using USA Spending Data
Buhe Li, Berkay Kaplan, Maksym Lazirko, Aleksandr Kogan
Main category: cs.LG
TL;DR: This study compares unsupervised outlier detection methods for audit analytics using USA spending data, finding that hybrid approaches combining multiple algorithms improve anomaly detection accuracy in governmental financial data.
Details
Motivation: Addressing the need for efficient anomaly detection in large-scale governmental datasets where traditional auditing methods may be insufficient, particularly for federal spending oversight.Method: Employed and compared multiple outlier detection algorithms (HBOS, Robust PCA, MCD, KNN) on DHHS spending data, with performance evaluation using precision, recall, and F1 scores.
Result: Results show that hybrid approaches combining multiple detection strategies enhance robustness and accuracy of outlier identification in complex financial data.
Conclusion: Unsupervised learning techniques can significantly improve audit quality and efficiency, with implications for auditors, policymakers, and researchers in governmental financial oversight.
Abstract: This study investigates the effectiveness of unsupervised outlier detection methods in audit analytics, utilizing USA spending data from the U.S. Department of Health and Human Services (DHHS) as a case example. We employ and compare multiple outlier detection algorithms, including Histogram-based Outlier Score (HBOS), Robust Principal Component Analysis (PCA), Minimum Covariance Determinant (MCD), and K-Nearest Neighbors (KNN) to identify anomalies in federal spending patterns. The research addresses the growing need for efficient and accurate anomaly detection in large-scale governmental datasets, where traditional auditing methods may fall short. Our methodology involves data preparation, algorithm implementation, and performance evaluation using precision, recall, and F1 scores. Results indicate that a hybrid approach, combining multiple detection strategies, enhances the robustness and accuracy of outlier identification in complex financial data. This study contributes to the field of audit analytics by providing insights into the comparative effectiveness of various outlier detection models and demonstrating the potential of unsupervised learning techniques in improving audit quality and efficiency. The findings have implications for auditors, policymakers, and researchers seeking to leverage advanced analytics in governmental financial oversight and risk management.
[341] Representation-based Broad Hallucination Detectors Fail to Generalize Out of Distribution
Zuzanna Dubanowska, Maciej Żelaszczyk, Michał Brzozowski, Paolo Mandica, Michał Karpowicz
Main category: cs.LG
TL;DR: Current SOTA hallucination detection methods show poor performance when controlling for spurious correlations, performing no better than simple supervised linear probes and failing at out-of-distribution generalization.
Details
Motivation: To critically evaluate the effectiveness of state-of-the-art hallucination detection methods and identify limitations in current approaches.Method: Analyzed SOTA methods on the RAGTruth dataset, controlled for spurious correlations, compared with supervised linear probes, and tested out-of-distribution generalization.
Result: SOTA performance is largely driven by spurious data correlations; when controlled, it performs no better than simple linear probes. All methods perform close to random on out-of-distribution data.
Conclusion: Current hallucination detection methods have significant limitations, and the paper proposes guidelines for improved evaluation and detection approaches.
Abstract: We critically assess the efficacy of the current SOTA in hallucination detection and find that its performance on the RAGTruth dataset is largely driven by a spurious correlation with data. Controlling for this effect, state-of-the-art performs no better than supervised linear probes, while requiring extensive hyperparameter tuning across datasets. Out-of-distribution generalization is currently out of reach, with all of the analyzed methods performing close to random. We propose a set of guidelines for hallucination detection and its evaluation.
[342] Uncertainty Quantification of Large Language Models using Approximate Bayesian Computation
Mridul Sharma, Adeetya Patel, Zaneta D’ Souza, Samira Abbasgholizadeh Rahimi, Siva Reddy, Sreenath Madathil
Main category: cs.LG
TL;DR: Proposes Approximate Bayesian Computation (ABC) to improve LLM uncertainty calibration in clinical diagnostics, achieving significant accuracy and calibration improvements over standard baselines.
Details
Motivation: LLMs struggle with uncertainty expression, which is critical for reliable deployment in high-stakes domains like clinical diagnostics. Existing methods produce overconfident and poorly calibrated estimates.Method: Uses Approximate Bayesian Computation (ABC), a likelihood-free Bayesian inference approach that treats LLMs as stochastic simulators to infer posterior distributions over predictive probabilities.
Result: ABC improves accuracy by up to 46.9%, reduces Brier scores by 74.4%, and enhances calibration as measured by Expected Calibration Error (ECE) and predictive entropy on clinical benchmarks.
Conclusion: The ABC approach significantly outperforms standard baseline methods for LLM uncertainty calibration in clinical diagnostic applications, providing more reliable uncertainty estimates.
Abstract: Despite their widespread applications, Large Language Models (LLMs) often struggle to express uncertainty, posing a challenge for reliable deployment in high stakes and safety critical domains like clinical diagnostics. Existing standard baseline methods such as model logits and elicited probabilities produce overconfident and poorly calibrated estimates. In this work, we propose Approximate Bayesian Computation (ABC), a likelihood-free Bayesian inference, based approach that treats LLMs as a stochastic simulator to infer posterior distributions over predictive probabilities. We evaluate our ABC approach on two clinically relevant benchmarks: a synthetic oral lesion diagnosis dataset and the publicly available GretelAI symptom-to-diagnosis dataset. Compared to standard baselines, our approach improves accuracy by up to 46.9%, reduces Brier scores by 74.4%, and enhances calibration as measured by Expected Calibration Error (ECE) and predictive entropy.
[343] Solving Freshness in RAG: A Simple Recency Prior and the Limits of Heuristic Trend Detection
Matthew Grofsky
Main category: cs.LG
TL;DR: Two methods for addressing temporal failures in RAG systems on cybersecurity data: recency prior achieved perfect accuracy on freshness tasks, while clustering heuristic for topic evolution failed with low F1-score.
Details
Motivation: To address temporal failures in RAG (Retrieval-Augmented Generation) systems, particularly in handling time-sensitive cybersecurity data where freshness and topic evolution are critical.Method: Tested two approaches: 1) A simple recency prior method for freshness detection, 2) A clustering heuristic for detecting topic evolution over time.
Result: Recency prior method achieved 1.00 accuracy on freshness tasks, while the clustering heuristic for topic evolution performed poorly with only 0.08 F1-score.
Conclusion: Simple heuristics like recency prior work well for freshness detection, but trend detection and topic evolution require more sophisticated methods beyond basic clustering approaches.
Abstract: We address temporal failures in RAG systems using two methods on cybersecurity data. A simple recency prior achieved an accuracy of 1.00 on freshness tasks. In contrast, a clustering heuristic for topic evolution failed (0.08 F1-score), showing trend detection requires methods beyond simple heuristics.
[344] Learning from Observation: A Survey of Recent Advances
Returaj Burnwal, Hriday Mehta, Nirav Pravinbhai Bhatt, Balaraman Ravindran
Main category: cs.LG
TL;DR: A survey paper that presents a framework for Learning from Observation (LfO) or state-only imitation learning, classifies existing methods, and connects LfO with related fields like offline RL, model-based RL, and hierarchical RL.
Details
Motivation: Traditional imitation learning requires both state and action information from expert demonstrations, which is impractical in real-world applications where expert actions are difficult to obtain. LfO addresses this limitation by using only expert state visitation information.Method: The paper presents a framework for LfO and uses it to survey and classify existing LfO methods based on trajectory construction, assumptions, and algorithm design choices. It also draws connections to related fields.
Result: The survey provides a comprehensive classification of LfO methods and identifies relationships between LfO and other reinforcement learning approaches.
Conclusion: The framework helps identify open problems in LfO and suggests future research directions for state-only imitation learning.
Abstract: Imitation Learning (IL) algorithms offer an efficient way to train an agent by mimicking an expert’s behavior without requiring a reward function. IL algorithms often necessitate access to state and action information from expert demonstrations. Although expert actions can provide detailed guidance, requiring such action information may prove impractical for real-world applications where expert actions are difficult to obtain. To address this limitation, the concept of learning from observation (LfO) or state-only imitation learning (SOIL) has recently gained attention, wherein the imitator only has access to expert state visitation information. In this paper, we present a framework for LfO and use it to survey and classify existing LfO methods in terms of their trajectory construction, assumptions and algorithm’s design choices. This survey also draws connections between several related fields like offline RL, model-based RL and hierarchical RL. Finally, we use our framework to identify open problems and suggest future research directions.
[345] TensLoRA: Tensor Alternatives for Low-Rank Adaptation
Axel Marmoret, Reda Bensaid, Jonathan Lys, Vincent Gripon, François Leduc-Primeau
Main category: cs.LG
TL;DR: TensLoRA is a unified framework that aggregates LoRA updates into higher-order tensors, enabling more efficient and flexible low-rank adaptation of Transformers with mode-specific compression rates.
Details
Motivation: Current LoRA methods treat attention projection matrices independently for each layer and projection type, lacking a systematic framework for joint tensor-based adaptations that could improve efficiency and performance.Method: TensLoRA aggregates LoRA updates into higher-order tensors, modeling a broad family of tensor-based low-rank adaptations that generalize existing methods and allow mode-specific compression rates tailored to modality and task.
Result: Experiments on vision and language benchmarks show that tensor construction directly impacts performance, sometimes outperforming standard LoRA under similar parameter counts.
Conclusion: TensLoRA provides a unified framework for tensor-based low-rank adaptation that enables more flexible parameter allocation and can achieve better performance than standard LoRA approaches.
Abstract: Low-Rank Adaptation (LoRA) is widely used to efficiently adapt Transformers by adding trainable low-rank matrices to attention projections. While effective, these matrices are considered independent for each attention projection (Query, Key, and Value) and each layer. Recent extensions have considered joint, tensor-based adaptations, but only in limited forms and without a systematic framework. We introduce TensLoRA, a unified framework that aggregates LoRA updates into higher-order tensors and models a broad family of tensor-based low-rank adaptations. Our formulation generalizes existing tensor-based methods and enables mode-specific compression rates, allowing parameter budgets to be tailored according to the modality and task. Experiments on vision and language benchmarks reveal that the tensor construction directly impacts performance, sometimes better than standard LoRA under similar parameter counts.
[346] OmniFed: A Modular Framework for Configurable Federated Learning from Edge to HPC
Sahil Tyagi, Andrei Cozma, Olivera Kotevska, Feiyi Wang
Main category: cs.LG
TL;DR: OmniFed is a modular federated learning framework that provides configuration-driven prototyping, support for different topologies and communication protocols, and pluggable privacy/compression mechanisms.
Details
Motivation: Federated Learning is crucial for edge and HPC environments where data cannot be centralized and privacy is essential, but existing solutions lack flexibility and comprehensive feature integration.Method: Developed a modular framework with decoupled architecture supporting configuration-driven prototyping, multiple topologies, mixed communication protocols, and pluggable modules for privacy (DP, HE, SA) and compression.
Result: OmniFed successfully unifies topology configuration, mixed-protocol communication, and pluggable modules in one stack, enabling streamlined FL deployment across heterogeneous environments.
Conclusion: The framework provides a comprehensive solution for federated learning that balances flexibility, privacy, and performance through its modular, configuration-driven approach.
Abstract: Federated Learning (FL) is critical for edge and High Performance Computing (HPC) where data is not centralized and privacy is crucial. We present OmniFed, a modular framework designed around decoupling and clear separation of concerns for configuration, orchestration, communication, and training logic. Its architecture supports configuration-driven prototyping and code-level override-what-you-need customization. We also support different topologies, mixed communication protocols within a single deployment, and popular training algorithms. It also offers optional privacy mechanisms including Differential Privacy (DP), Homomorphic Encryption (HE), and Secure Aggregation (SA), as well as compression strategies. These capabilities are exposed through well-defined extension points, allowing users to customize topology and orchestration, learning logic, and privacy/compression plugins, all while preserving the integrity of the core system. We evaluate multiple models and algorithms to measure various performance metrics. By unifying topology configuration, mixed-protocol communication, and pluggable modules in one stack, OmniFed streamlines FL deployment across heterogeneous environments. Github repository is available at https://github.com/at-aaims/OmniFed.
[347] TimeMosaic: Temporal Heterogeneity Guided Time Series Forecasting via Adaptive Granularity Patch and Segment-wise Decoding
Kuiye Ding, Fanda Fan, Chunyi Hou, Zheya Wang, Lei Wang, Zhengxin Yang, Jianfeng Zhan
Main category: cs.LG
TL;DR: TimeMosaic is a multivariate time series forecasting framework that addresses temporal heterogeneity through adaptive patch embedding and segment-wise decoding, outperforming existing methods.
Details
Motivation: Existing patch-based methods use fixed-length segmentation, which overlooks heterogeneity in local temporal dynamics and decoding requirements, leading to loss of details in information-dense regions and redundancy in stable segments.Method: TimeMosaic employs adaptive patch embedding to dynamically adjust granularity based on local information density, and introduces segment-wise decoding that treats each prediction horizon as a related subtask with horizon-specific adaptation.
Result: Extensive evaluations show TimeMosaic delivers consistent improvements over existing methods, and when trained on a large-scale corpus with 321 billion observations, achieves performance competitive with state-of-the-art TSFMs.
Conclusion: The proposed framework effectively addresses temporal heterogeneity in multivariate time series forecasting through adaptive granularity and horizon-specific decoding strategies.
Abstract: Multivariate time series forecasting is essential in domains such as finance, transportation, climate, and energy. However, existing patch-based methods typically adopt fixed-length segmentation, overlooking the heterogeneity of local temporal dynamics and the decoding heterogeneity of forecasting. Such designs lose details in information-dense regions, introduce redundancy in stable segments, and fail to capture the distinct complexities of short-term and long-term horizons. We propose TimeMosaic, a forecasting framework that aims to address temporal heterogeneity. TimeMosaic employs adaptive patch embedding to dynamically adjust granularity according to local information density, balancing motif reuse with structural clarity while preserving temporal continuity. In addition, it introduces segment-wise decoding that treats each prediction horizon as a related subtask and adapts to horizon-specific difficulty and information requirements, rather than applying a single uniform decoder. Extensive evaluations on benchmark datasets demonstrate that TimeMosaic delivers consistent improvements over existing methods, and our model trained on the large-scale corpus with 321 billion observations achieves performance competitive with state-of-the-art TSFMs.
[348] Enhancing Credit Default Prediction Using Boruta Feature Selection and DBSCAN Algorithm with Different Resampling Techniques
Obu-Amoah Ampomah, Edmund Agyemang, Kofi Acheampong, Louis Agyekum
Main category: cs.LG
TL;DR: This study compares SMOTE, SMOTE-Tomek, and ADASYN techniques for handling class imbalance in credit default prediction, finding that Boruta+DBSCAN+SMOTE-Tomek+GBM classifier achieved the best performance.
Details
Motivation: Credit default datasets are typically imbalanced with few defaulters, making accurate prediction challenging. The research aims to address this class imbalance problem to improve credit default prediction systems.Method: Used real-world credit default data from Cleveland ML Repository. Applied feature selection (Boruta) and outlier detection (DBSCAN) before/after resampling. Tested traditional classifiers (Naive Bayes, KNN) and ensemble boosting algorithms (XGBoost, AdaBoost, GBM, Light GBM) with three resampling techniques (SMOTE, SMOTE-Tomek, ADASYN).
Result: Boruta+DBSCAN+SMOTE-Tomek+GBM classifier performed best with F1-score: 82.56%, G-mean: 82.98%, ROC-AUC: 90.90%, PR-AUC: 91.85%.
Conclusion: The findings provide a foundation for developing more resilient credit default systems, which is essential as credit-based transactions continue to increase globally.
Abstract: This study examines credit default prediction by comparing three techniques, namely SMOTE, SMOTE-Tomek, and ADASYN, that are commonly used to address the class imbalance problem in credit default situations. Recognizing that credit default datasets are typically skewed, with defaulters comprising a much smaller proportion than non-defaulters, we began our analysis by evaluating machine learning (ML) models on the imbalanced data without any resampling to establish baseline performance. These baseline results provide a reference point for understanding the impact of subsequent balancing methods. In addition to traditional classifiers such as Naive Bayes and K-Nearest Neighbors (KNN), our study also explores the suitability of advanced ensemble boosting algorithms, including Extreme Gradient Boosting (XGBoost), AdaBoost, Gradient Boosting Machines (GBM), and Light GBM for credit default prediction using Boruta feature selection and DBSCAN-based outlier detection, both before and after resampling. A real-world credit default data set sourced from the University of Cleveland ML Repository was used to build ML classifiers, and their performances were tested. The criteria chosen to measure model performance are the area under the receiver operating characteristic curve (ROC-AUC), area under the precision-recall curve (PR-AUC), G-mean, and F1-scores. The results from this empirical study indicate that the Boruta+DBSCAN+SMOTE-Tomek+GBM classifier outperformed the other ML models (F1-score: 82.56%, G-mean: 82.98%, ROC-AUC: 90.90%, PR-AUC: 91.85%) in a credit default context. The findings establish a foundation for future progress in creating more resilient and adaptive credit default systems, which will be essential as credit-based transactions continue to rise worldwide.
[349] Analyzing Uncertainty Quantification in Statistical and Deep Learning Models for Probabilistic Electricity Price Forecasting
Andreas Lebedev, Abhinav Das, Sven Pappert, Stephan Schlüter
Main category: cs.LG
TL;DR: This paper examines uncertainty quantification in probabilistic forecasting models for electricity prices, comparing deep learning and statistical approaches with various uncertainty methods like ensembles, MC dropout, and conformal prediction.
Details
Motivation: Most probabilistic forecasting models don't fully capture uncertainty from both data and model choices, which is crucial for energy risk management. The study aims to address this gap in electricity price forecasting.Method: The study compares deep distributional neural networks (DDNNs) with ensemble, MC dropout, and conformal prediction methods against LASSO-estimated autoregressive (LEAR) models with quantile regression averaging, GARCH, and conformal prediction.
Result: LEAR-based models performed well for probabilistic forecasting regardless of uncertainty method. DDNNs improved with both data and model uncertainty. Conformal prediction best captured uncertainty, and all models were competitive but performance depended on chosen metrics.
Conclusion: All models performed competitively, with LEAR models showing strong probabilistic forecasting capabilities and DDNNs benefiting from comprehensive uncertainty quantification. Model performance is metric-dependent, highlighting the importance of uncertainty quantification methods like conformal prediction.
Abstract: Precise probabilistic forecasts are fundamental for energy risk management, and there is a wide range of both statistical and machine learning models for this purpose. Inherent to these probabilistic models is some form of uncertainty quantification. However, most models do not capture the full extent of uncertainty, which arises not only from the data itself but also from model and distributional choices. In this study, we examine uncertainty quantification in state-of-the-art statistical and deep learning probabilistic forecasting models for electricity price forecasting in the German market. In particular, we consider deep distributional neural networks (DDNNs) and augment them with an ensemble approach, Monte Carlo (MC) dropout, and conformal prediction to account for model uncertainty. Additionally, we consider the LASSO-estimated autoregressive (LEAR) approach combined with quantile regression averaging (QRA), generalized autoregressive conditional heteroskedasticity (GARCH), and conformal prediction. Across a range of performance metrics, we find that the LEAR-based models perform well in terms of probabilistic forecasting, irrespective of the uncertainty quantification method. Furthermore, we find that DDNNs benefit from incorporating both data and model uncertainty, improving both point and probabilistic forecasting. Uncertainty itself appears to be best captured by the models using conformal prediction. Overall, our extensive study shows that all models under consideration perform competitively. However, their relative performance depends on the choice of metrics for point and probabilistic forecasting.
[350] Probabilistic Runtime Verification, Evaluation and Risk Assessment of Visual Deep Learning Systems
Birk Torpmann-Hagen, Pål Halvorsen, Michael A. Riegler, Dag Johansen
Main category: cs.LG
TL;DR: A novel methodology for verifying, evaluating, and risk-assessing deep learning systems by modeling distributional shifts through probability estimation and binary tree structures to provide more accurate performance estimates.
Details
Motivation: Deep neural networks often underperform in real-world deployment due to sensitivity to distributional shifts that are common in practical scenarios but rarely accounted for during evaluation, leading to inflated performance metrics.Method: Explicitly models distributional shifts at runtime by estimating their probability from out-of-distribution detectors, combines these estimates with conditional probabilities of network correctness in a binary tree structure, and computes credible accuracy estimates through tree traversal.
Result: The approach consistently outperforms conventional evaluation across five datasets, with accuracy estimation errors typically between 0.01 and 0.1, and successfully demonstrates risk assessment capabilities on a medical segmentation benchmark.
Conclusion: Provides a robust framework for improving reliability and trustworthiness of deep learning systems in safety-critical applications by delivering more accurate performance estimates and actionable risk assessments.
Abstract: Despite achieving excellent performance on benchmarks, deep neural networks often underperform in real-world deployment due to sensitivity to minor, often imperceptible shifts in input data, known as distributional shifts. These shifts are common in practical scenarios but are rarely accounted for during evaluation, leading to inflated performance metrics. To address this gap, we propose a novel methodology for the verification, evaluation, and risk assessment of deep learning systems. Our approach explicitly models the incidence of distributional shifts at runtime by estimating their probability from outputs of out-of-distribution detectors. We combine these estimates with conditional probabilities of network correctness, structuring them in a binary tree. By traversing this tree, we can compute credible and precise estimates of network accuracy. We assess our approach on five different datasets, with which we simulate deployment conditions characterized by differing frequencies of distributional shift. Our approach consistently outperforms conventional evaluation, with accuracy estimation errors typically ranging between 0.01 and 0.1. We further showcase the potential of our approach on a medical segmentation benchmark, wherein we apply our methods towards risk assessment by associating costs with tree nodes, informing cost-benefit analyses and value-judgments. Ultimately, our approach offers a robust framework for improving the reliability and trustworthiness of deep learning systems, particularly in safety-critical applications, by providing more accurate performance estimates and actionable risk assessments.
[351] A Realistic Evaluation of Cross-Frequency Transfer Learning and Foundation Forecasting Models
Kin G. Olivares, Malcolm Wolff, Tatiana Konstantinova, Shankar Ramasubramanian, Andrew Gordon Wilson, Andres Potapczynski, Willa Potosnak, Mengfei Cao, Boris Oreshkin, Dmitry Efimov
Main category: cs.LG
TL;DR: Current benchmarking practices for cross-frequency transfer learning (CFTL) in foundation forecasting models are flawed due to small datasets, improper statistical handling, and data leakage. The study reimplements neural networks for CFTL, uses only proprietary/synthetic data to prevent leakage, and evaluates on 15 large datasets, finding statistical models outperform FFMs by significant margins.
Details
Motivation: To address shortcomings in CFTL benchmarking practices including over-reliance on small datasets, inadequate statistical treatment, reporting of suboptimal models, and failure to account for test data leakage.Method: Unified reimplementation of neural forecasting networks adapted for CFTL setup; pre-training only on proprietary and synthetic data while preventing test leakage; evaluation on 15 large, diverse public forecast competition datasets.
Result: Statistical models and their ensembles consistently outperform existing FFMs by more than 8.2% in sCRPS and more than 20% in MASE across datasets. Synthetic dataset pre-training improves FFM accuracy by 7%.
Conclusion: Current CFTL benchmarking practices are inadequate, and while statistical models significantly outperform FFMs, synthetic data pre-training does provide measurable improvements to FFMs.
Abstract: Cross-frequency transfer learning (CFTL) has emerged as a popular framework for curating large-scale time series datasets to pre-train foundation forecasting models (FFMs). Although CFTL has shown promise, current benchmarking practices fall short of accurately assessing its performance. This shortcoming stems from many factors: an over-reliance on small-scale evaluation datasets; inadequate treatment of sample size when computing summary statistics; reporting of suboptimal statistical models; and failing to account for non-negligible risks of overlap between pre-training and test datasets. To address these limitations, we introduce a unified reimplementation of widely-adopted neural forecasting networks, adapting them for the CFTL setup; we pre-train only on proprietary and synthetic data, being careful to prevent test leakage; and we evaluate on 15 large, diverse public forecast competition datasets. Our empirical analysis reveals that statistical models’ accuracy is frequently underreported. Notably, we confirm that statistical models and their ensembles consistently outperform existing FFMs by more than 8.2% in sCRPS, and by more than 20% MASE, across datasets. However, we also find that synthetic dataset pre-training does improve the accuracy of a FFM by 7% percent.
[352] THINNs: Thermodynamically Informed Neural Networks
Javier Castro, Benjamin Gess
Main category: cs.LG
TL;DR: Proposes THINNs - a thermodynamically consistent extension of PINNs that uses physically informed penalization based on fluctuation structure rather than heuristic penalties.
Details
Motivation: Standard PINNs use heuristic penalty terms for PDE residual minimization. This work aims to develop a physically consistent penalization approach for non-equilibrium fluctuating systems based on large deviations principles.Method: Develops THINNs by choosing penalty terms that penalize improbable deviations according to the underlying fluctuation structure characterized by large deviations principles, rather than using heuristic penalties.
Result: Establishes analytical a posteriori estimates for THINNs and provides empirical comparisons showing improvements over established penalization strategies in PINNs.
Conclusion: THINNs offer a thermodynamically consistent formulation of PINNs with physically motivated penalization that outperforms heuristic approaches for non-equilibrium fluctuating systems.
Abstract: Physics-Informed Neural Networks (PINNs) are a class of deep learning models aiming to approximate solutions of PDEs by training neural networks to minimize the residual of the equation. Focusing on non-equilibrium fluctuating systems, we propose a physically informed choice of penalization that is consistent with the underlying fluctuation structure, as characterized by a large deviations principle. This approach yields a novel formulation of PINNs in which the penalty term is chosen to penalize improbable deviations, rather than being selected heuristically. The resulting thermodynamically consistent extension of PINNs, termed THINNs, is subsequently analyzed by establishing analytical a posteriori estimates, and providing empirical comparisons to established penalization strategies.
[353] Transformer Modeling for Both Scalability and Performance in Multivariate Time Series
Hunjae Lee, Corey Clark
Main category: cs.LG
TL;DR: DELTAformer addresses scalability and performance issues in multivariate time series transformers by using delegate tokens to constrain inter-variable mixing while enabling full inter-temporal modeling, achieving linear scaling and state-of-the-art performance.
Details
Motivation: Variable count scalability and indiscriminate inter-variable mixing cause noise accumulation and performance degradation in multivariate time series transformers, especially with sparse informative signals and heterogeneous variables.Method: Proposes DELTAformer with delegate tokens that constrain inter-variable modeling while allowing full inter-temporal modeling, acting as implicit regularizers for selective information propagation.
Result: DELTAformer scales linearly with variable count while outperforming standard transformers, achieving SOTA performance and superior noise-resilience in noisy MTS environments.
Conclusion: By aligning model design with MTS domain challenges, DELTAformer simultaneously achieves linear scaling and improved performance over quadratic transformers.
Abstract: Variable count is among the main scalability bottlenecks for transformer modeling in multivariate time series (MTS) data. On top of this, a growing consensus in the field points to indiscriminate inter-variable mixing as a potential source of noise-accumulation and performance degradation. This is likely exacerbated by sparsity of informative signals characteristic of many MTS systems coupled with representational misalignment stemming from indiscriminate information mixing between (heterogeneous) variables. While scalability and performance are often seen as competing interests in transformer design, we show that both can be improved simultaneously in MTS by strategically constraining the representational capacity of inter-variable mixing. Our proposed method, transformer with Delegate Token Attention (DELTAformer), constrains inter-variable modeling through what we call delegate tokens which are then used to perform full, unconstrained, inter-temporal modeling. Delegate tokens act as an implicit regularizer that forces the model to be highly selective about what inter-variable information is allowed to propagate through the network. Our results show that DELTAformer scales linearly with variable-count while actually outperforming standard transformers, achieving state-of-the-art performance across benchmarks and baselines. In addition, DELTAformer can focus on relevant signals better than standard transformers in noisy MTS environments and overall exhibit superior noise-resilience. Overall, results across various experiments confirm that by aligning our model design to leverage domain-specific challenges in MTS to our advantage, DELTAformer can simultaneously achieve linear scaling while actually improving its performance against standard, quadratic transformers.
[354] Constraint-Reduced MILP with Local Outlier Factor Modeling for Plausible Counterfactual Explanations in Credit Approval
Trung Nguyen Thanh, Huyen Giang Thi Thu, Tai Le Quy, Ha-Bang Ban
Main category: cs.LG
TL;DR: A refined MILP formulation for counterfactual explanation that reduces constraints in LOF objective, achieving faster solving times while maintaining quality.
Details
Motivation: Existing plausible CE methods consider data distribution but introduce many constraints, leading to high computational costs.Method: Revisit DACE framework with refined MILP formulation that reduces LOF constraints, applied to linear SVM classifier with standard scaler.
Result: Experimental results show faster solving times while maintaining explanation quality.
Conclusion: The approach demonstrates promise for more efficient LOF modeling in counterfactual explanation and data science applications.
Abstract: Counterfactual explanation (CE) is a widely used post-hoc method that provides individuals with actionable changes to alter an unfavorable prediction from a machine learning model. Plausible CE methods improve realism by considering data distribution characteristics, but their optimization models introduce a large number of constraints, leading to high computational cost. In this work, we revisit the DACE framework and propose a refined Mixed-Integer Linear Programming (MILP) formulation that significantly reduces the number of constraints in the local outlier factor (LOF) objective component. We also apply the method to a linear SVM classifier with standard scaler. The experimental results show that our approach achieves faster solving times while maintaining explanation quality. These results demonstrate the promise of more efficient LOF modeling in counterfactual explanation and data science applications.
[355] Frame-based Equivariant Diffusion Models for 3D Molecular Generation
Mohan Guo, Cong Liu, Patrick Forré
Main category: cs.LG
TL;DR: The paper proposes frame-based diffusion methods that achieve deterministic E(3)-equivariance while decoupling symmetry handling from the backbone architecture, offering a scalable and flexible approach to molecular generation.
Details
Motivation: To address the trade-off between strict equivariance with costly architectures and relaxed equivariance for scalability in molecular generation methods.Method: Three frame-based diffusion variants: Global Frame Diffusion (GFD) with shared molecular frame, Local Frame Diffusion (LFD) with node-specific frames, and Invariant Frame Diffusion (IFD) with pre-canonicalized representations. Enhanced with EdgeDiT, a Diffusion Transformer with edge-aware attention.
Result: GFD with EdgeDiT achieves state-of-the-art performance on QM9 dataset: test NLL of -137.97 (standard scale) and -141.85 (double scale), 98.98% atom stability, 90.51% molecular stability, surpassing equivariant baselines with nearly 2x faster sampling than EDM.
Conclusion: Frame-based diffusion establishes a scalable, flexible, and physically grounded paradigm for molecular generation, emphasizing the importance of global structure preservation.
Abstract: Recent methods for molecular generation face a trade-off: they either enforce strict equivariance with costly architectures or relax it to gain scalability and flexibility. We propose a frame-based diffusion paradigm that achieves deterministic E(3)-equivariance while decoupling symmetry handling from the backbone. Building on this paradigm, we investigate three variants: Global Frame Diffusion (GFD), which assigns a shared molecular frame; Local Frame Diffusion (LFD), which constructs node-specific frames and benefits from additional alignment constraints; and Invariant Frame Diffusion (IFD), which relies on pre-canonicalized invariant representations. To enhance expressivity, we further utilize EdgeDiT, a Diffusion Transformer with edge-aware attention. On the QM9 dataset, GFD with EdgeDiT achieves state-of-the-art performance, with a test NLL of -137.97 at standard scale and -141.85 at double scale, alongside atom stability of 98.98%, and molecular stability of 90.51%. These results surpass all equivariant baselines while maintaining high validity and uniqueness and nearly 2x faster sampling compared to EDM. Altogether, our study establishes frame-based diffusion as a scalable, flexible, and physically grounded paradigm for molecular generation, highlighting the critical role of global structure preservation.
[356] Metriplectic Conditional Flow Matching for Dissipative Dynamics
Ali Baheri, Lars Lindemann
Main category: cs.LG
TL;DR: Metriplectic conditional flow matching (MCFM) is a method that learns dissipative dynamics while preserving physical principles, ensuring stable long-horizon rollouts by incorporating conservative-dissipative structure into both vector fields and samplers.
Details
Motivation: Neural surrogates often inject energy and destabilize long-horizon rollouts. MCFM aims to address this by building first-principles structure into the learning process to ensure conservation and dissipation properties.Method: MCFM uses conditional flow matching on short transitions to avoid long rollout adjoints. During inference, it employs a Strang-prox scheme that alternates symplectic updates with proximal metric steps, with optional projection for strict energy decay when trusted energy is available.
Result: On a controlled mechanical benchmark, MCFM produces phase portraits closer to ground truth with significantly fewer energy-increase and positive energy rate events compared to unconstrained neural flows, while maintaining comparable terminal distributional fit.
Conclusion: MCFM successfully learns dissipative dynamics that respect physical principles, providing continuous and discrete time guarantees for conservation, monotonic dissipation, and stable rollouts, making it superior to unconstrained neural approaches for long-term stability.
Abstract: Metriplectic conditional flow matching (MCFM) learns dissipative dynamics without violating first principles. Neural surrogates often inject energy and destabilize long-horizon rollouts; MCFM instead builds the conservative-dissipative split into both the vector field and a structure preserving sampler. MCFM trains via conditional flow matching on short transitions, avoiding long rollout adjoints. In inference, a Strang-prox scheme alternates a symplectic update with a proximal metric step, ensuring discrete energy decay; an optional projection enforces strict decay when a trusted energy is available. We provide continuous and discrete time guarantees linking this parameterization and sampler to conservation, monotonic dissipation, and stable rollouts. On a controlled mechanical benchmark, MCFM yields phase portraits closer to ground truth and markedly fewer energy-increase and positive energy rate events than an equally expressive unconstrained neural flow, while matching terminal distributional fit.
[357] DAWM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions
Zongyue Li, Xiao Han, Yusong Li, Niklas Strauss, Matthias Schubert
Main category: cs.LG
TL;DR: DAWM is a diffusion-based world model that generates state-reward trajectories conditioned on current state, action, and return-to-go, paired with an inverse dynamics model for action inference, enabling effective offline RL training.
Details
Motivation: Existing diffusion-based world models often don't generate actions alongside states and rewards, limiting compatibility with standard value-based offline RL algorithms that rely on one-step TD learning. Joint modeling approaches increase training complexity and reduce performance.Method: Proposes DAWM with modular design: diffusion model generates future state-reward trajectories conditioned on current state, action, and return-to-go, combined with an inverse dynamics model for efficient action inference to produce complete synthetic transitions.
Result: Conservative offline RL algorithms (TD3BC and IQL) benefit significantly from training on augmented trajectories, consistently outperforming prior diffusion-based baselines across multiple D4RL benchmark tasks.
Conclusion: The modular design enables effective and computationally efficient training for one-step TD-based offline RL, demonstrating improved performance over existing approaches.
Abstract: Diffusion-based world models have demonstrated strong capabilities in synthesizing realistic long-horizon trajectories for offline reinforcement learning (RL). However, many existing methods do not directly generate actions alongside states and rewards, limiting their compatibility with standard value-based offline RL algorithms that rely on one-step temporal difference (TD) learning. While prior work has explored joint modeling of states, rewards, and actions to address this issue, such formulations often lead to increased training complexity and reduced performance in practice. We propose \textbf{DAWM}, a diffusion-based world model that generates future state-reward trajectories conditioned on the current state, action, and return-to-go, paired with an inverse dynamics model (IDM) for efficient action inference. This modular design produces complete synthetic transitions suitable for one-step TD-based offline RL, enabling effective and computationally efficient training. Empirically, we show that conservative offline RL algorithms such as TD3BC and IQL benefit significantly from training on these augmented trajectories, consistently outperforming prior diffusion-based baselines across multiple tasks in the D4RL benchmark.
[358] Learning Dynamics of Deep Learning – Force Analysis of Deep Neural Networks
Yi Ren
Main category: cs.LG
TL;DR: This paper proposes a force analysis-inspired framework to understand how deep learning models learn over time by examining how training examples influence each other during training.
Details
Motivation: To systematically understand deep learning model behaviors by analyzing how one training example affects another during learning, similar to force analysis in physics.Method: Break down training influence into two components: similarity between examples and strength of updating force. Apply this framework to various learning tasks to analyze model behaviors.
Result: The framework explains non-trivial learning paths of certain examples, effectiveness of LLM finetuning methods, and why simpler patterns are learned more easily. It also uncovers new strategies for improving model training.
Conclusion: While still developing, this force analysis-inspired approach offers a systematic way to interpret model behaviors and provides insights for enhancing training strategies.
Abstract: This thesis explores how deep learning models learn over time, using ideas inspired by force analysis. Specifically, we zoom in on the model’s training procedure to see how one training example affects another during learning, like analyzing how forces move objects. We break this influence into two parts: how similar the two examples are, and how strong the updating force is. This framework helps us understand a wide range of the model’s behaviors in different real systems. For example, it explains why certain examples have non-trivial learning paths, why (and why not) some LLM finetuning methods work, and why simpler, more structured patterns tend to be learned more easily. We apply this approach to various learning tasks and uncover new strategies for improving model training. While the method is still developing, it offers a new way to interpret models’ behaviors systematically.
[359] A Foundation Chemical Language Model for Comprehensive Fragment-Based Drug Discovery
Alexander Ho, Sukyeong Lee, Francis T. F. Tsai
Main category: cs.LG
TL;DR: FragAtlas-62M is a specialized foundation model trained on 62M+ molecules from ZINC-22, achieving unprecedented fragment chemical space coverage with 99.90% chemical validity.
Details
Motivation: To create a comprehensive foundation model for fragment generation that covers the largest fragment dataset to date and produces chemically valid structures.Method: Built a GPT-2 based model (42.7M parameters) trained on the complete ZINC-22 fragment subset comprising over 62 million molecules.
Result: The model generates 99.90% chemically valid fragments, closely matches training distribution across 12 descriptors and three fingerprint methods (all effect sizes < 0.4), retains 53.6% of known ZINC fragments while producing 22% novel structures.
Conclusion: FragAtlas-62M represents a significant advancement in fragment generation with practical relevance, and the authors release the model, code, data, and documentation to accelerate adoption.
Abstract: We introduce FragAtlas-62M, a specialized foundation model trained on the largest fragment dataset to date. Built on the complete ZINC-22 fragment subset comprising over 62 million molecules, it achieves unprecedented coverage of fragment chemical space. Our GPT-2 based model (42.7M parameters) generates 99.90% chemically valid fragments. Validation across 12 descriptors and three fingerprint methods shows generated fragments closely match the training distribution (all effect sizes < 0.4). The model retains 53.6% of known ZINC fragments while producing 22% novel structures with practical relevance. We release FragAtlas-62M with training code, preprocessed data, documentation, and model weights to accelerate adoption.
[360] Modular Machine Learning with Applications to Genetic Circuit Composition
Jichi Wang, Eduardo D. Sontag, Domitilla Del Vecchio
Main category: cs.LG
TL;DR: A modular learning framework that uses compositional structure knowledge to identify module functions from system data with reduced training requirements, enabling better generalization than structure-agnostic approaches.
Details
Motivation: Many systems have unknown module functions but known composition architecture. Leveraging this structural knowledge can reduce training data needs and enable module identification for designing new systems.Method: Proposes modular identifiability concept to recover module functions from subset of system data. Uses neural networks that incorporate compositional structure, with theoretical guarantees for genetic circuit-like systems.
Result: Structure-aware neural networks successfully learn module functions and generalize to out-of-distribution inputs, while structure-agnostic networks fail to generalize beyond training distribution.
Conclusion: The framework reduces experimental data requirements and enables module identification, facilitating synthetic biological circuit design and multi-module system development.
Abstract: In several applications, including in synthetic biology, one often has input/output data on a system composed of many modules, and although the modules’ input/output functions and signals may be unknown, knowledge of the composition architecture can significantly reduce the amount of training data required to learn the system’s input/output mapping. Learning the modules’ input/output functions is also necessary for designing new systems from different composition architectures. Here, we propose a modular learning framework, which incorporates prior knowledge of the system’s compositional structure to (a) identify the composing modules’ input/output functions from the system’s input/output data and (b) achieve this by using a reduced amount of data compared to what would be required without knowledge of the compositional structure. To achieve this, we introduce the notion of modular identifiability, which allows recovery of modules’ input/output functions from a subset of the system’s input/output data, and provide theoretical guarantees on a class of systems motivated by genetic circuits. We demonstrate the theory on computational studies showing that a neural network (NNET) that accounts for the compositional structure can learn the composing modules’ input/output functions and predict the system’s output on inputs outside of the training set distribution. By contrast, a neural network that is agnostic of the structure is unable to predict on inputs that fall outside of the training set distribution. By reducing the need for experimental data and allowing module identification, this framework offers the potential to ease the design of synthetic biological circuits and of multi-module systems more generally.
[361] Improved Therapeutic Antibody Reformatting through Multimodal Machine Learning
Jiayi Xin, Aniruddh Raghu, Nick Bhattacharya, Adam Carr, Melanie Montgomery, Hunter Elliott
Main category: cs.LG
TL;DR: Machine learning framework predicts antibody reformatting success using sequence and structural data, outperforming large protein language models in challenging scenarios.
Details
Motivation: Complex therapeutic antibody formats present engineering challenges where individual domain functions aren't guaranteed in novel formats, requiring better prediction tools.Method: Developed multimodal ML framework incorporating antibody sequence and structural context with realistic deployment evaluation protocols.
Result: Domain-tailored multimodal representations outperform large pretrained protein language models, especially in “new antibody, no data” scenarios with high predictive accuracy.
Conclusion: The framework enables prioritization of promising antibody reformatting candidates and reduces wasted experimental effort in therapeutic antibody design.
Abstract: Modern therapeutic antibody design often involves composing multi-part assemblages of individual functional domains, each of which may be derived from a different source or engineered independently. While these complex formats can expand disease applicability and improve safety, they present a significant engineering challenge: the function and stability of individual domains are not guaranteed in the novel format, and the entire molecule may no longer be synthesizable. To address these challenges, we develop a machine learning framework to predict “reformatting success” – whether converting an antibody from one format to another will succeed or not. Our framework incorporates both antibody sequence and structural context, incorporating an evaluation protocol that reflects realistic deployment scenarios. In experiments on a real-world antibody reformatting dataset, we find the surprising result that large pretrained protein language models (PLMs) fail to outperform simple, domain-tailored, multimodal representations. This is particularly evident in the most difficult evaluation setting, where we test model generalization to a new starting antibody. In this challenging “new antibody, no data” scenario, our best multimodal model achieves high predictive accuracy, enabling prioritization of promising candidates and reducing wasted experimental effort.
[362] Adaptive von Mises-Fisher Likelihood Loss for Supervised Deep Time Series Hashing
Juan Manuel Perez, Kevin Garcia, Brooklyn Berry, Dongjin Song, Yifeng Gao
Main category: cs.LG
TL;DR: Proposes a von Mises-Fisher hashing loss for deep learning-based time series indexing to reduce information loss during binary code discretization by mapping data to hyperspherical space.
Details
Motivation: Traditional deep hashing methods suffer from significant information loss when converting real-valued representations to binary codes, which affects the quality of time series indexing based on semantic meaning.Method: The method maps data to an M-dimensional hyperspherical space and models each data class as points following distinct von Mises-Fisher distributions, with a loss function designed to maximize separation between these distributions.
Result: Experimental results demonstrate that the proposed method outperforms existing baselines in time series indexing performance.
Conclusion: The von Mises-Fisher hashing approach effectively reduces information loss in deep hashing for time series, providing better semantic separation and improved indexing performance compared to existing methods.
Abstract: Indexing time series by creating compact binary representations is a fundamental task in time series data mining. Recently, deep learning-based hashing methods have proven effective for indexing time series based on semantic meaning rather than just raw similarity. The purpose of deep hashing is to map samples with the same semantic meaning to identical binary hash codes, enabling more efficient search and retrieval. Unlike other supervised representation learning methods, supervised deep hashing requires a discretization step to convert real-valued representations into binary codes, but this can induce significant information loss. In this paper, we propose a von Mises-Fisher (vMF) hashing loss. The proposed deep hashing model maps data to an M-dimensional hyperspherical space to effectively reduce information loss and models each data class as points following distinct vMF distributions. The designed loss aims to maximize the separation between each modeled vMF distribution to provide a better way to maximize the margin between each semantically different data sample. Experimental results show that our method outperforms existing baselines. The implementation is publicly available at https://github.com/jmpq97/vmf-hashing
[363] Mamba Modulation: On the Length Generalization of Mamba
Peng Lu, Jerry Huang, Qiuhao Zeng, Xinyu Wang, Boxing Wang, Philippe Langlais, Yufei Cui
Main category: cs.LG
TL;DR: Mamba state-space models perform poorly on contexts longer than pre-training due to out-of-distribution behavior in state-space dynamics. The paper attributes this to the spectrum of the transition matrix A and proposes spectrum scaling to enable robust long-context generalization.
Details
Motivation: Mamba achieves state-of-the-art results but suffers from significant performance deterioration when applied to contexts longer than pre-training, revealing sensitivity to context length extension that limits its practical utility.Method: The authors analyze Mamba’s state-space dynamics and identify the spectrum of the transition matrix A as the key limitation. They propose spectrum scaling applied to pre-trained Mamba models to selectively modulate the spectrum of A matrices in each layer.
Result: Spectrum scaling significantly improves performance in long-context settings where simply modulating discretization time steps fails, validating the connection between state convergence behavior and transition matrix spectrum.
Conclusion: The proposed spectrum scaling approach provides a well-founded solution for length generalization in state-space models with structured transition matrices, offering avenues for better long-context performance.
Abstract: The quadratic complexity of the attention mechanism in Transformer models has motivated the development of alternative architectures with sub-quadratic scaling, such as state-space models. Among these, Mamba has emerged as a leading architecture, achieving state-of-the-art results across a range of language modeling tasks. However, Mamba’s performance significantly deteriorates when applied to contexts longer than those seen during pre-training, revealing a sharp sensitivity to context length extension. Through detailed analysis, we attribute this limitation to the out-of-distribution behaviour of its state-space dynamics, particularly within the parameterization of the state transition matrix $\mathbf{A}$. Unlike recent works which attribute this sensitivity to the vanished accumulation of discretization time steps, $\exp(-\sum_{t=1}^N\Delta_t)$, we establish a connection between state convergence behavior as the input length approaches infinity and the spectrum of the transition matrix $\mathbf{A}$, offering a well-founded explanation of its role in length extension. Next, to overcome this challenge, we propose an approach that applies spectrum scaling to pre-trained Mamba models to enable robust long-context generalization by selectively modulating the spectrum of $\mathbf{A}$ matrices in each layer. We show that this can significantly improve performance in settings where simply modulating $\Delta_t$ fails, validating our insights and providing avenues for better length generalization of state-space models with structured transition matrices.
[364] TIMED: Adversarial and Autoregressive Refinement of Diffusion-Based Time Series Generation
MohammadReza EskandariNasab, Shah Muhammad Hamdi, Soukaina Filali Boubrahimi
Main category: cs.LG
TL;DR: TIMED is a unified generative framework for synthesizing high-quality time series data by combining diffusion models, supervisor networks, and adversarial training with MMD loss.
Details
Motivation: Real time series data can be scarce, noisy, or costly to collect, making synthetic generation crucial for applications like forecasting and anomaly detection. Time series generation is challenging as it requires modeling both marginal distributions and temporal dependencies.Method: TIMED integrates: 1) DDPM for global structure via diffusion processes, 2) supervisor network with teacher forcing for autoregressive dependencies, 3) Wasserstein critic for temporal smoothness, and 4) MMD loss for feature space alignment. All components use masked attention architectures trained jointly.
Result: Experimental results across diverse multivariate time series benchmarks show TIMED generates more realistic and temporally coherent sequences than state-of-the-art generative models.
Conclusion: TIMED provides an effective unified framework that successfully captures both unconditional and conditional aspects of time series data, outperforming existing methods in generating high-quality synthetic time series.
Abstract: Generating high-quality synthetic time series is a fundamental yet challenging task across domains such as forecasting and anomaly detection, where real data can be scarce, noisy, or costly to collect. Unlike static data generation, synthesizing time series requires modeling both the marginal distribution of observations and the conditional temporal dependencies that govern sequential dynamics. We propose TIMED, a unified generative framework that integrates a denoising diffusion probabilistic model (DDPM) to capture global structure via a forward-reverse diffusion process, a supervisor network trained with teacher forcing to learn autoregressive dependencies through next-step prediction, and a Wasserstein critic that provides adversarial feedback to ensure temporal smoothness and fidelity. To further align the real and synthetic distributions in feature space, TIMED incorporates a Maximum Mean Discrepancy (MMD) loss, promoting both diversity and sample quality. All components are built using masked attention architectures optimized for sequence modeling and are trained jointly to effectively capture both unconditional and conditional aspects of time series data. Experimental results across diverse multivariate time series benchmarks demonstrate that TIMED generates more realistic and temporally coherent sequences than state-of-the-art generative models.
[365] Toward Scalable and Structured Global Station Weather Forecasting
Hongyi Chen, Xiucheng Li, Xinyang Chen, Yun Cheng, Jing Li, Kehai Chen, Liqiang Nie
Main category: cs.LG
TL;DR: A novel Spatial Structured Attention Block for global station weather forecasting that models both local and global spatial correlations through intra-subgraph and inter-subgraph attention mechanisms.
Details
Motivation: Existing time series forecasting methods often ignore or unidirectionally model spatial correlation, which contradicts the intrinsic nature of global weather systems and limits forecast performance.Method: Proposes a Spatial Structured Attention Block that partitions spatial graphs into subgraphs, uses Intra-subgraph Attention for local spatial correlation, and Inter-subgraph Attention for global message passing. Builds a multiscale spatiotemporal model by progressively expanding subgraph scales.
Result: Achieves performance improvements up to 16.8% over time series forecasting baselines at low running costs.
Conclusion: The proposed model is scalable, produces structured spatial correlation, easy to implement, and significantly outperforms existing methods while maintaining computational efficiency.
Abstract: Global Station Weather Forecasting (GSWF) is a key meteorological research area, critical to energy, aviation, and agriculture. Existing time series forecasting methods often ignore or unidirectionally model spatial correlation when conducting large-scale global station forecasting. This contradicts the intrinsic nature underlying observations of the global weather system, limiting forecast performance. To address this, we propose a novel Spatial Structured Attention Block in this paper. It partitions the spatial graph into a set of subgraphs and instantiates Intra-subgraph Attention to learn local spatial correlation within each subgraph, and aggregates nodes into subgraph representations for message passing among the subgraphs via Inter-subgraph Attention – considering both spatial proximity and global correlation. Building on this block, we develop a multiscale spatiotemporal forecasting model by progressively expanding subgraph scales. The resulting model is both scalable and able to produce structured spatial correlation, and meanwhile, it is easy to implement. The experimental results show that it can achieve performance improvements up to 16.8% over time series forecasting baselines at low running costs.
[366] Symbol-Temporal Consistency Self-supervised Learning for Robust Time Series Classification
Kevin Garcia, Cassandra Garza, Brooklyn Berry, Yifeng Gao
Main category: cs.LG
TL;DR: A self-supervised contrastive learning framework that uses bag-of-symbol representation to handle data distribution shifts in digital health time series data caused by different human behaviors.
Details
Motivation: Time series data in digital health is highly noisy, involves concept drifting, and poses challenges for training generalizable deep learning models, particularly due to data distribution shifts from different human behaviors.Method: Proposes a self-supervised learning framework that incorporates bag-of-symbol representation, which is known for its insensitivity to data warping, location shifts, and noise in time series data.
Result: The proposed method achieves significantly better performance in scenarios where significant data shifting exists.
Conclusion: Bag-of-symbol representation can effectively guide deep learning to acquire representations resistant to data shifting in digital health time series analysis.
Abstract: The surge in the significance of time series in digital health domains necessitates advanced methodologies for extracting meaningful patterns and representations. Self-supervised contrastive learning has emerged as a promising approach for learning directly from raw data. However, time series data in digital health is known to be highly noisy, inherently involves concept drifting, and poses a challenge for training a generalizable deep learning model. In this paper, we specifically focus on data distribution shift caused by different human behaviors and propose a self-supervised learning framework that is aware of the bag-of-symbol representation. The bag-of-symbol representation is known for its insensitivity to data warping, location shifts, and noise existed in time series data, making it potentially pivotal in guiding deep learning to acquire a representation resistant to such data shifting. We demonstrate that the proposed method can achieve significantly better performance where significant data shifting exists.
[367] Consistent Estimation of Numerical Distributions under Local Differential Privacy by Wavelet Expansion
Puning Zhao, Zhikun Zhang, Bo Sun, Li Shen, Liang Zhang, Shaowei Wang, Zhe Liu
Main category: cs.LG
TL;DR: A new wavelet expansion method for distribution estimation under local differential privacy that prevents probability mass misplacement in numerical data, outperforming existing solutions.
Details
Motivation: Existing LDP methods work well for categorical data but fail on numerical data due to different evaluation metrics, particularly the need to prevent probability mass from being misplaced far away from ground truth.Method: Express sample distribution using wavelet expansions, estimate wavelet coefficients under LDP with priority on low-order coefficients to ensure accurate macroscopic estimation.
Result: Theoretical guarantees established; experiments show significant outperformance over existing solutions under Wasserstein and KS distances.
Conclusion: Wavelet expansion approach effectively addresses numerical data distribution estimation under LDP by prioritizing low-order coefficients, preventing probability mass misplacement and achieving superior performance.
Abstract: Distribution estimation under local differential privacy (LDP) is a fundamental and challenging task. Significant progresses have been made on categorical data. However, due to different evaluation metrics, these methods do not work well when transferred to numerical data. In particular, we need to prevent the probability mass from being misplaced far away. In this paper, we propose a new approach that express the sample distribution using wavelet expansions. The coefficients of wavelet series are estimated under LDP. Our method prioritizes the estimation of low-order coefficients, in order to ensure accurate estimation at macroscopic level. Therefore, the probability mass is prevented from being misplaced too far away from its ground truth. We establish theoretical guarantees for our methods. Experiments show that our wavelet expansion method significantly outperforms existing solutions under Wasserstein and KS distances.
[368] Revisiting Performance Claims for Chest X-Ray Models Using Clinical Context
Andrew Wang, Jiashuo Zhang, Michael Oberst
Main category: cs.LG
TL;DR: CXR models perform best on cases with low pre-test probability and worse on high-probability cases, with much diagnostic power potentially coming from inferring clinical context rather than true diagnostic signal.
Details
Motivation: Strong average-case performance on CXR datasets is insufficient to certify clinical utility, requiring more holistic evaluation using clinical context.Method: Using discharge summaries prior to each CXR to derive pre-test probabilities as proxy for clinical context, then evaluating model performance across different probability levels and on balanced test sets.
Result: Models perform better on low pre-test probability cases and worse on high-probability cases; performance drops sharply on balanced test sets where clinical context shortcuts are removed.
Conclusion: Analysis using clinical context from notes is promising for more rigorous evaluation of clinical vision models, revealing potential overreliance on context rather than true diagnostic capabilities.
Abstract: Public healthcare datasets of Chest X-Rays (CXRs) have long been a popular
benchmark for developing computer vision models in healthcare. However, strong
average-case performance of machine learning (ML) models on these datasets is
insufficient to certify their clinical utility. In this paper, we use clinical
context, as captured by prior discharge summaries, to provide a more holistic
evaluation of current state-of-the-art'' models for the task of CXR diagnosis. Using discharge summaries recorded prior to each CXR, we derive a
prior’’ or ``pre-test’’ probability of each CXR label, as a proxy for
existing contextual knowledge available to clinicians when interpreting CXRs.
Using this measure, we demonstrate two key findings: First, for several
diagnostic labels, CXR models tend to perform best on cases where the pre-test
probability is very low, and substantially worse on cases where the pre-test
probability is higher. Second, we use pre-test probability to assess whether
strong average-case performance reflects true diagnostic signal, rather than an
ability to infer the pre-test probability as a shortcut. We find that
performance drops sharply on a balanced test set where this shortcut does not
exist, which may indicate that much of the apparent diagnostic power derives
from inferring this clinical context. We argue that this style of analysis,
using context derived from clinical notes, is a promising direction for more
rigorous and fine-grained evaluation of clinical vision models.
[369] C${}^2$Prompt: Class-aware Client Knowledge Interaction for Federated Continual Learning
Kunlun Xu, Yibo Feng, Jiangmeng Li, Yongsheng Qi, Jiahuan Zhou
Main category: cs.LG
TL;DR: C²Prompt addresses class-wise knowledge coherence issues in federated continual learning by enhancing intra-class consistency and reducing inter-class confusion during prompt communication.
Details
Motivation: Existing prompt-based FCL methods suffer from class-wise knowledge coherence problems, including intra-class distribution gaps across clients and inter-prompt class-wise relevance issues, which exacerbate both spatial and temporal forgetting.Method: Proposes Class-aware Client Knowledge Interaction (C²Prompt) with two components: Local Class Distribution Compensation (LCDC) to reduce intra-class disparities, and Class-aware Prompt Aggregation (CPA) to selectively strengthen class-relevant knowledge aggregation.
Result: Extensive experiments on multiple FCL benchmarks demonstrate that C²Prompt achieves state-of-the-art performance.
Conclusion: The proposed method effectively addresses class-wise knowledge coherence issues in federated continual learning, leading to improved performance by reducing both intra-class distribution gaps and inter-class knowledge confusion.
Abstract: Federated continual learning (FCL) tackles scenarios of learning from continuously emerging task data across distributed clients, where the key challenge lies in addressing both temporal forgetting over time and spatial forgetting simultaneously. Recently, prompt-based FCL methods have shown advanced performance through task-wise prompt communication.In this study, we underscore that the existing prompt-based FCL methods are prone to class-wise knowledge coherence between prompts across clients. The class-wise knowledge coherence includes two aspects: (1) intra-class distribution gap across clients, which degrades the learned semantics across prompts, (2) inter-prompt class-wise relevance, which highlights cross-class knowledge confusion. During prompt communication, insufficient class-wise coherence exacerbates knowledge conflicts among new prompts and induces interference with old prompts, intensifying both spatial and temporal forgetting. To address these issues, we propose a novel Class-aware Client Knowledge Interaction (C${}^2$Prompt) method that explicitly enhances class-wise knowledge coherence during prompt communication. Specifically, a local class distribution compensation mechanism (LCDC) is introduced to reduce intra-class distribution disparities across clients, thereby reinforcing intra-class knowledge consistency. Additionally, a class-aware prompt aggregation scheme (CPA) is designed to alleviate inter-class knowledge confusion by selectively strengthening class-relevant knowledge aggregation. Extensive experiments on multiple FCL benchmarks demonstrate that C${}^2$Prompt achieves state-of-the-art performance. Our source code is available at https://github.com/zhoujiahuan1991/NeurIPS2025-C2Prompt
[370] A Unified Noise-Curvature View of Loss of Trainability
Gunbir Singh Baveja, Mark Schmidt
Main category: cs.LG
TL;DR: The paper analyzes loss of trainability (LoT) in continual learning with Adam optimizer, finding that single indicators are unreliable predictors. It introduces two complementary criteria for predicting trainability behavior and develops a per-layer scheduler to stabilize training.
Details
Motivation: Loss of trainability occurs when gradient steps no longer yield improvement as tasks evolve, causing accuracy to stall or degrade despite adequate capacity and supervision. The paper aims to understand and address this issue specifically with Adam optimizer.Method: The authors analyze LoT through an optimization lens and introduce two criteria: a batch-size-aware gradient-noise bound and a curvature volatility-controlled bound. These combine into a per-layer predictive threshold. They build a simple per-layer scheduler that keeps each layer’s effective step below a safe limit.
Result: The method stabilizes training and improves accuracy across various techniques including concatenated ReLU (CReLU), Wasserstein regularization, and L2 weight decay. The learned learning-rate trajectories mirror canonical decay patterns.
Conclusion: The proposed per-layer scheduler based on complementary predictive thresholds effectively addresses loss of trainability in continual learning with Adam, providing stable training and improved performance across different regularization methods.
Abstract: Loss of trainability (LoT) in continual learning occurs when gradient steps no longer yield improvement as tasks evolve, so accuracy stalls or degrades despite adequate capacity and supervision. We analyze LoT incurred with Adam through an optimization lens and find that single indicators such as Hessian rank, sharpness level, weight or gradient norms, gradient-to-parameter ratios, and unit-sign entropy are not reliable predictors. Instead we introduce two complementary criteria: a batch-size-aware gradient-noise bound and a curvature volatility-controlled bound that combine into a per-layer predictive threshold that anticipates trainability behavior. Using this threshold, we build a simple per-layer scheduler that keeps each layers effective step below a safe limit, stabilizing training and improving accuracy across concatenated ReLU (CReLU), Wasserstein regularization, and L2 weight decay, with learned learning-rate trajectories that mirror canonical decay.
[371] Linear Transformers Implicitly Discover Unified Numerical Algorithms
Patrick Lutz, Aditya Gangrade, Hadi Daneshmand, Venkatesh Saligrama
Main category: cs.LG
TL;DR: A linear attention transformer trained on masked-block matrix completion tasks implicitly discovers a unified iterative solver that works across different computational regimes without explicit guidance.
Details
Motivation: To explore whether transformers can autonomously discover mathematical algorithms and iterative solvers through in-context learning on matrix completion problems, without being given traditional optimization methods or equations.Method: Train a linear attention transformer on millions of masked-block matrix completion tasks, where prompts are masked low-rank matrices with missing blocks that could be scalar prediction targets or kernel slices for Nyström extrapolation. The model only sees input-output pairs with mean-squared loss.
Result: The trained transformer implicitly discovers a parameter-free update rule that achieves second-order convergence, reduces distributed iteration complexity, and maintains accuracy with rank-limited attention across three computational regimes.
Conclusion: Transformers trained solely on matrix completion tasks can autonomously discover unified, resource-adaptive iterative solvers, demonstrating the powerful capability of in-context learning to extract mathematical algorithms from data.
Abstract: We train a linear attention transformer on millions of masked-block matrix completion tasks: each prompt is masked low-rank matrix whose missing block may be (i) a scalar prediction target or (ii) an unseen kernel slice of Nystr"om extrapolation. The model sees only input-output pairs and a mean-squared loss; it is given no normal equations, no handcrafted iterations, and no hint that the tasks are related. Surprisingly, after training, algebraic unrolling reveals the same parameter-free update rule across three distinct computational regimes (full visibility, rank-limited updates, and distributed computation). We prove that this rule achieves second-order convergence on full-batch problems, cuts distributed iteration complexity, and remains accurate with rank-limited attention. Thus, a transformer trained solely to patch missing blocks implicitly discovers a unified, resource-adaptive iterative solver spanning prediction, estimation, and Nystr"om extrapolation, highlighting a powerful capability of in-context learning.
[372] Causal Machine Learning for Surgical Interventions
J. Ben Tamo, Nishant S. Chouhan, Micky C. Nnamdi, Yining Yuan, Shreya S. Chivilkar, Wenqi Shi, Steven W. Hwang, B. Randall Brenn, May D. Wang
Main category: cs.LG
TL;DR: X-MultiTask is a multi-task meta-learning framework for individualized treatment effect (ITE) estimation in surgical decision-making, incorporating inverse probability weighting for causal validity and outperforming baselines on spinal fusion and scoliosis datasets.
Details
Motivation: Surgical decision-making requires understanding causal relationships between patient characteristics, interventions, and outcomes, but traditional statistical methods struggle with complex, heterogeneous data in high-stakes settings like spinal fusion or scoliosis correction.Method: Developed a multi-task meta-learning framework that models each surgical decision as a distinct task while learning shared representations across tasks, incorporating inverse probability weighting (IPW) into the training objective for causal validity.
Result: Achieved highest average AUC (0.84) in anterior group and competitive performance in posterior group (0.77), with lowest overall ε_NN-PEHE (0.2778) and ε_ATE (0.0763) on spinal fusion dataset. Similarly superior performance on AIS dataset with ε_NN-PEHE = 0.2551 and ε_ATE = 0.0902.
Conclusion: X-MultiTask provides robust, patient-specific causal estimates, offering a powerful tool to advance personalized surgical care and improve patient outcomes.
Abstract: Surgical decision-making is complex and requires understanding causal relationships between patient characteristics, interventions, and outcomes. In high-stakes settings like spinal fusion or scoliosis correction, accurate estimation of individualized treatment effects (ITEs) remains limited due to the reliance on traditional statistical methods that struggle with complex, heterogeneous data. In this study, we develop a multi-task meta-learning framework, X-MultiTask, for ITE estimation that models each surgical decision (e.g., anterior vs. posterior approach, surgery vs. no surgery) as a distinct task while learning shared representations across tasks. To strengthen causal validity, we incorporate the inverse probability weighting (IPW) into the training objective. We evaluate our approach on two datasets: (1) a public spinal fusion dataset (1,017 patients) to assess the effect of anterior vs. posterior approaches on complication severity; and (2) a private AIS dataset (368 patients) to analyze the impact of posterior spinal fusion (PSF) vs. non-surgical management on patient-reported outcomes (PROs). Our model achieves the highest average AUC (0.84) in the anterior group and maintains competitive performance in the posterior group (0.77). It outperforms baselines in treatment effect estimation with the lowest overall $\epsilon_{\text{NN-PEHE}}$ (0.2778) and $\epsilon_{\text{ATE}}$ (0.0763). Similarly, when predicting PROs in AIS, X-MultiTask consistently shows superior performance across all domains, with $\epsilon_{\text{NN-PEHE}}$ = 0.2551 and $\epsilon_{\text{ATE}}$ = 0.0902. By providing robust, patient-specific causal estimates, X-MultiTask offers a powerful tool to advance personalized surgical care and improve patient outcomes. The code is available at https://github.com/Wizaaard/X-MultiTask.
[373] Cuffless Blood Pressure Prediction from Speech Sentences using Deep Learning Methods
Kainat
Main category: cs.LG
TL;DR: Novel noninvasive ABP prediction using BERT-based regression on speech signals, achieving MAE of 13.6 mmHg (SBP) and 12.4 mmHg (DBP) with high R² scores.
Details
Motivation: Traditional cuff-based BP monitoring has limitations like inconsistent results due to whitecoat/masked hypertension. Need for comfortable, real-time monitoring methods.Method: BERT-based regression model analyzing acoustic characteristics of speech signals to correlate voice features with blood pressure levels. Dataset from 95 participants.
Result: Achieved MAE of 13.6 mmHg for SBP and 12.4 mmHg for DBP, with R² scores of 0.99 and 0.94 respectively. Demonstrated effective learning with minimal overfitting.
Conclusion: Speech analysis with deep learning provides viable alternative for BP monitoring, enabling improved telemedicine and remote health monitoring applications.
Abstract: This research presents a novel method for noninvasive arterial blood pressure ABP prediction using speech signals employing a BERT based regression model Arterial blood pressure is a vital indicator of cardiovascular health and accurate monitoring is essential in preventing hypertension related complications Traditional cuff based methods often yield inconsistent results due to factors like whitecoat and masked hypertension Our approach leverages the acoustic characteristics of speech capturing voice features to establish correlations with blood pressure levels Utilizing advanced deep learning techniques we analyze speech signals to extract relevant patterns enabling real time monitoring without the discomfort of conventional methods In our study we employed a dataset comprising recordings from 95 participants ensuring diverse representation The BERT model was fine tuned on extracted features from speech leading to impressive performance metrics achieving a mean absolute error MAE of 136 mmHg for systolic blood pressure SBP and 124 mmHg for diastolic blood pressure DBP with R scores of 099 and 094 respectively These results indicate the models robustness in accurately predicting blood pressure levels Furthermore the training and validation loss analysis demonstrates effective learning and minimal overfitting Our findings suggest that integrating deep learning with speech analysis presents a viable alternative for blood pressure monitoring paving the way for improved applications in telemedicine and remote health monitoring By providing a user friendly and accurate method for blood pressure assessment this research has significant implications for enhancing patient care and proactive management of cardiovascular health
[374] Frictional Q-Learning
Hyunwoo Kim, Hyo Kyung Lee
Main category: cs.LG
TL;DR: Frictional Q-learning is a deep RL algorithm for continuous control that uses a friction-inspired constraint to prevent policy drift and extrapolation error by keeping actions close to those in the replay buffer.
Details
Motivation: To address extrapolation error in off-policy RL by drawing an analogy with static friction in classical mechanics, preventing policies from drifting toward unsupported actions.Method: Extends batch-constrained RL by constraining the agent’s action space to encourage behavior similar to the replay buffer while maintaining distance from the orthonormal action space manifold.
Result: The algorithm achieves robust training and competitive performance across standard continuous control benchmarks.
Conclusion: Frictional Q-learning provides an intuitive physical interpretation of extrapolation error while maintaining the simplicity of batch-constrained methods.
Abstract: We draw an analogy between static friction in classical mechanics and extrapolation error in off-policy RL, and use it to formulate a constraint that prevents the policy from drifting toward unsupported actions. In this study, we present Frictional Q-learning, a deep reinforcement learning algorithm for continuous control, which extends batch-constrained reinforcement learning. Our algorithm constrains the agent’s action space to encourage behavior similar to that in the replay buffer, while maintaining a distance from the manifold of the orthonormal action space. The constraint preserves the simplicity of batch-constrained, and provides an intuitive physical interpretation of extrapolation error. Empirically, we further demonstrate that our algorithm is robustly trained and achieves competitive performance across standard continuous control benchmarks.
[375] Sobolev acceleration for neural networks
Jong Kwon Oh, Hanbaek Lyu, Hwijae Son
Main category: cs.LG
TL;DR: This paper provides the first rigorous theoretical framework proving that Sobolev training accelerates convergence of ReLU networks, with exact formulas for gradients and Hessians under student-teacher framework.
Details
Motivation: Sobolev training has shown empirical benefits in accelerating convergence and improving generalization compared to conventional L² training, but the underlying mechanisms remain only partially understood.Method: The authors develop a theoretical framework using student-teacher setup with Gaussian inputs and shallow ReLU architectures, deriving exact formulas for population gradients and Hessians to quantify improvements in loss landscape conditioning and gradient-flow convergence rates.
Result: The theoretical analysis proves Sobolev training accelerates convergence, with extensive numerical experiments validating the findings and showing benefits extend to modern deep learning tasks.
Conclusion: This work provides rigorous theoretical justification for Sobolev training’s effectiveness, demonstrating improved conditioning and faster convergence rates for ReLU networks, with practical relevance to contemporary deep learning applications.
Abstract: Sobolev training, which integrates target derivatives into the loss functions, has been shown to accelerate convergence and improve generalization compared to conventional $L^2$ training. However, the underlying mechanisms of this training method remain only partially understood. In this work, we present the first rigorous theoretical framework proving that Sobolev training accelerates the convergence of Rectified Linear Unit (ReLU) networks. Under a student-teacher framework with Gaussian inputs and shallow architectures, we derive exact formulas for population gradients and Hessians, and quantify the improvements in conditioning of the loss landscape and gradient-flow convergence rates. Extensive numerical experiments validate our theoretical findings and show that the benefits of Sobolev training extend to modern deep learning tasks.
[376] PPGFlowECG: Latent Rectified Flow with Cross-Modal Encoding for PPG-Guided ECG Generation and Cardiovascular Disease Detection
Xiaocheng Fang, Jiarui Jin, Haoyu Wang, Che Liu, Jieyi Cai, Guangkun Nie, Jun Li, Hongyan Li, Shenda Hong
Main category: cs.LG
TL;DR: PPGFlowECG is a two-stage framework that translates PPG signals into clinically valuable ECG signals using latent space alignment and rectified flow, achieving high fidelity and improved diagnostic reliability for cardiovascular disease detection.
Details
Motivation: ECG is the gold standard for cardiac monitoring but requires specialized equipment, while PPG offers accessible continuous monitoring but lacks definitive diagnostic information. Generative models can bridge this gap by translating PPG to ECG.Method: Two-stage framework with CardioAlign Encoder to align PPG and ECG in shared latent space, followed by latent rectified flow for high-fidelity ECG generation. Tested on MCMED dataset with over 10 million paired PPG-ECG samples.
Result: Method effectively performs PPG-to-ECG translation and cardiovascular disease detection. Cardiologist evaluations confirm synthesized ECGs achieve high fidelity and improve diagnostic reliability.
Conclusion: PPGFlowECG shows strong potential for real-world cardiovascular screening by enabling accessible continuous monitoring with clinically valuable ECG information.
Abstract: In clinical practice, electrocardiography (ECG) remains the gold standard for cardiac monitoring, providing crucial insights for diagnosing a wide range of cardiovascular diseases (CVDs). However, its reliance on specialized equipment and trained personnel limits feasibility for continuous routine monitoring. Photoplethysmography (PPG) offers accessible, continuous monitoring but lacks definitive electrophysiological information, preventing conclusive diagnosis. Generative models present a promising approach to translate PPG into clinically valuable ECG signals, yet current methods face substantial challenges, including the misalignment of physiological semantics in generative models and the complexity of modeling in high-dimensional signals. To this end, we propose PPGFlowECG, a two-stage framework that aligns PPG and ECG in a shared latent space via the CardioAlign Encoder and employs latent rectified flow to generate ECGs with high fidelity and interpretability. To the best of our knowledge, this is the first study to experiment on MCMED, a newly released clinical-grade dataset comprising over 10 million paired PPG-ECG samples from more than 118,000 emergency department visits with expert-labeled cardiovascular disease annotations. Results demonstrate the effectiveness of our method for PPG-to-ECG translation and cardiovascular disease detection. Moreover, cardiologist-led evaluations confirm that the synthesized ECGs achieve high fidelity and improve diagnostic reliability, underscoring our method’s potential for real-world cardiovascular screening.
[377] Faster, Smaller, and Smarter: Task-Aware Expert Merging for Online MoE Inference
Ziyi Han, Xutong Liu, Ruiting Zhou, Xiangxiang Dai, John C. S. Lui
Main category: cs.LG
TL;DR: Tanbr is a tree-structured adaptive neural bandit router that enables efficient online MoE inference by estimating task distributions and performing task-aware expert merging without explicit task tags.
Details
Motivation: Deploying Sparse Mixture of Experts (SMoE) for online inference is challenging due to large model size, complex expert routing, and lack of task information in resource-constrained edge networks.Method: Tanbr estimates task distribution from historical data, uses binary tree to partition merging weight space, and applies neural bandit to learn non-linear mapping from merging weights to model performance for optimal expert merging.
Result: Tanbr achieves sublinear regret bound of O(√T log(T)), reduces inference latency by at least 45%, memory usage by up to 25%, while maintaining high accuracy compared to state-of-the-art methods.
Conclusion: The proposed Tanbr router provides an efficient and reliable solution for online MoE inference by enabling task-aware expert merging without explicit task information, making SMoE deployment practical for edge networks.
Abstract: Sparse Mixture of Experts (SMoE) has become a preferred architecture for scaling Transformer capacity without increasing computational cost, as it activates only a small subset of experts for each input. However, deploying such an approach for \textit{online inference} remains challenging due to the large size of a full SMoE model and the complexity of expert routing, especially in resource-constrained edge networks. Moreover, during the online inference, task information is often unavailable, making the task-level routing error-prone. In this work, we propose a novel tree-structured adaptive neural bandit router, \texttt{Tanbr}, to enable efficient and reliable online MoE inference. Instead of relying on explicit task tags, \texttt{Tanbr} estimates the task distribution over time from historical data and uses it to guide task-aware expert merging within a given pre-trained MoE. To handle the large continuous space of merging weights, \texttt{Tanbr} employs a binary tree to progressively partition the space and generate finer candidate weights. It then applies a neural bandit to learn the non-linear mapping from merging weight to model performance and decides optimal expert merging. We prove that \texttt{Tanbr} achieves a sublinear regret bound of {\small $\mathcal{O}(\sqrt{T} \log(T))$} over {\small $T$} rounds, despite operating over a continuous decision space, matching regret bounds compared to existing methods. Extensive experiments show that \texttt{Tanbr} reduces inference latency by at least {\small $45%$} and memory usage by up to {\small $25%$}, while maintaining a high accuracy compared to many state-of-the-art methods.
[378] RDAR: Reward-Driven Agent Relevance Estimation for Autonomous Driving
Carlo Bosio, Greg Woelki, Noureldin Hendy, Nicholas Roy, Byungsoo Kim
Main category: cs.LG
TL;DR: RDAR is a method to learn per-agent relevance in autonomous driving by identifying which agents can be excluded from input to a pre-trained behavior model, reducing computational complexity while maintaining performance.
Details
Motivation: Human drivers focus on only a few relevant agents at a time, while current autonomous systems process all agents regardless of relevance, leading to inefficient quadratic attention mechanisms.Method: Formulate agent masking as a Markov Decision Process where actions are binary masks indicating agent selection, learning to exclude irrelevant agents from a pre-trained behavior model.
Result: RDAR achieves comparable driving performance (progress, safety, metrics) while processing significantly fewer agents compared to state-of-the-art behavior models.
Conclusion: The proposed RDAR method successfully learns accurate agent relevance measures, enabling efficient autonomous driving by focusing computational resources on truly influential agents.
Abstract: Human drivers focus only on a handful of agents at any one time. On the other hand, autonomous driving systems process complex scenes with numerous agents, regardless of whether they are pedestrians on a crosswalk or vehicles parked on the side of the road. While attention mechanisms offer an implicit way to reduce the input to the elements that affect decisions, existing attention mechanisms for capturing agent interactions are quadratic, and generally computationally expensive. We propose RDAR, a strategy to learn per-agent relevance – how much each agent influences the behavior of the controlled vehicle – by identifying which agents can be excluded from the input to a pre-trained behavior model. We formulate the masking procedure as a Markov Decision Process where the action consists of a binary mask indicating agent selection. We evaluate RDAR on a large-scale driving dataset, and demonstrate its ability to learn an accurate numerical measure of relevance by achieving comparable driving performance, in terms of overall progress, safety and performance, while processing significantly fewer agents compared to a state of the art behavior model.
[379] VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models
Guochao Jiang, Wenfeng Feng, Guofeng Quan, Chuzhan Hao, Yuewei Zhang, Guohua Liu, Hao Wang
Main category: cs.LG
TL;DR: VCRL is a curriculum reinforcement learning framework that dynamically controls training sample difficulty based on reward variance, addressing limitations of existing rollout-based RL methods in LLM mathematical reasoning.
Details
Motivation: Existing rollout-based RL methods fail to consider LLMs' varying learning abilities for samples of different difficulty levels, which contradicts human cognitive processes that progress from easy to difficult in mathematical reasoning.Method: VCRL uses the variance of rollout group rewards as an indicator of sample difficulty, with samples that are too easy or too difficult having lower variance, and moderately difficult samples having higher variance. The framework dynamically adjusts training sample difficulty based on this variance metric.
Result: Experiments on five mathematical benchmarks and two models show that VCRL outperforms current LLM RL baselines.
Conclusion: VCRL provides an effective curriculum learning approach for LLM mathematical reasoning by dynamically controlling sample difficulty based on reward variance, aligning better with human cognitive processes.
Abstract: Policy-based reinforcement learning currently plays an important role in improving LLMs on mathematical reasoning tasks. However, existing rollout-based reinforcement learning methods (GRPO, DAPO, GSPO, etc.) fail to explicitly consider LLMs’ learning ability for samples of different difficulty levels, which is contrary to the human cognitive process of mathematical reasoning tasks from easy to difficult. Intuitively, we find that the variance of the rollout group’s reward in RLVR partly reflects the difficulty of the current sample for LLMs. Samples that are too easy or too difficult have a lower variance, while samples with moderate difficulty have a higher variance. Based on this, we propose VCRL, a curriculum reinforcement learning framework that dynamically controls the difficulty of training samples based on the variance of group rewards. Experiments on five mathematical benchmarks and two models reveal the advantages of VCRL over the current LLM RL baselines.
[380] An Efficient Conditional Score-based Filter for High Dimensional Nonlinear Filtering Problems
Zhijun Zeng, Weiye Gan, Junqing Chen, Zuoqiang Shi
Main category: cs.LG
TL;DR: CSF is a novel conditional score-based filtering algorithm that enables efficient posterior sampling without retraining by using a set-transformer encoder and conditional diffusion model, separating prior modeling and posterior sampling into offline/online stages.
Details
Motivation: High-dimensional nonlinear filtering remains challenging, and existing score-based diffusion methods require impractical repeated retraining to track evolving priors in high dimensions.Method: Proposes Conditional Score-based Filter (CSF) using a set-transformer encoder and conditional diffusion model to decouple prior modeling (offline) from posterior sampling (online), eliminating the need for retraining.
Result: Extensive experiments on benchmark problems demonstrate CSF achieves superior accuracy, robustness, and efficiency across diverse nonlinear filtering scenarios.
Conclusion: CSF provides a scalable solution for score-based filtering that works effectively across various nonlinear systems without the impractical retraining requirements of previous methods.
Abstract: In many engineering and applied science domains, high-dimensional nonlinear filtering is still a challenging problem. Recent advances in score-based diffusion models offer a promising alternative for posterior sampling but require repeated retraining to track evolving priors, which is impractical in high dimensions. In this work, we propose the Conditional Score-based Filter (CSF), a novel algorithm that leverages a set-transformer encoder and a conditional diffusion model to achieve efficient and accurate posterior sampling without retraining. By decoupling prior modeling and posterior sampling into offline and online stages, CSF enables scalable score-based filtering across diverse nonlinear systems. Extensive experiments on benchmark problems show that CSF achieves superior accuracy, robustness, and efficiency across diverse nonlinear filtering scenarios.
[381] On the Rate of Convergence of Kolmogorov-Arnold Network Regression Estimators
Wei Liu, Eleni Chatzi, Zhilu Lai
Main category: cs.LG
TL;DR: This paper establishes theoretical convergence guarantees for Kolmogorov-Arnold Networks (KANs) using B-splines, proving they achieve minimax-optimal convergence rates for Sobolev spaces and providing guidelines for optimal knot selection.
Details
Motivation: To provide a solid theoretical foundation for using KANs in nonparametric regression by establishing their convergence properties and demonstrating their potential as structured, interpretable alternatives to existing methods.Method: Theoretical analysis of KANs with B-spline representations, proving convergence rates for both additive and hybrid additive-multiplicative KANs in Sobolev spaces, with simulation studies to validate the theoretical results.
Result: Proved that KANs achieve the minimax-optimal convergence rate O(n^{-2r/(2r+1)}) for functions in Sobolev spaces of smoothness r, and derived optimal knot selection guidelines for B-splines.
Conclusion: The theoretical results provide a strong foundation for using KANs in nonparametric regression, highlighting their potential as structured and interpretable alternatives to existing methods, with confirmed convergence rates through simulations.
Abstract: Kolmogorov-Arnold Networks (KANs) offer a structured and interpretable framework for multivariate function approximation by composing univariate transformations through additive or multiplicative aggregation. This paper establishes theoretical convergence guarantees for KANs when the univariate components are represented by B-splines. We prove that both additive and hybrid additive-multiplicative KANs attain the minimax-optimal convergence rate $O(n^{-2r/(2r+1)})$ for functions in Sobolev spaces of smoothness $r$. We further derive guidelines for selecting the optimal number of knots in the B-splines. The theory is supported by simulation studies that confirm the predicted convergence rates. These results provide a theoretical foundation for using KANs in nonparametric regression and highlight their potential as a structured alternative to existing methods.
[382] BoreaRL: A Multi-Objective Reinforcement Learning Environment for Climate-Adaptive Boreal Forest Management
Kevin Bradley Dsouza, Enoch Ofosu, Daniel Chukwuemeka Amaogu, Jérôme Pigeon, Richard Boudreault, Pooneh Maghoul, Juan Moreno-Cruz, Yuri Leonenko
Main category: cs.LG
TL;DR: BoreaRL is the first multi-objective reinforcement learning environment for climate-adaptive boreal forest management, revealing that carbon objectives are easier to optimize than permafrost preservation objectives, with current MORL methods struggling to achieve robust multi-objective policies.
Details
Motivation: Boreal forests store 30-40% of terrestrial carbon in climate-vulnerable permafrost soils, making their management critical for climate mitigation, but current tools cannot adequately address the complex trade-offs between carbon sequestration and permafrost preservation.Method: The authors introduce BoreaRL, a physically-grounded simulator of coupled energy, carbon, and water fluxes, supporting two training paradigms: site-specific mode and generalist mode. They evaluate multi-objective RL algorithms and analyze learned management strategies.
Result: Carbon objectives are significantly easier to optimize than thaw objectives, with thaw-focused policies showing minimal learning progress. In generalist settings, standard preference-conditioned approaches fail, while curriculum learning achieves superior performance. Carbon-focused policies favor aggressive high-density coniferous stands, while effective multi-objective policies balance species composition and density.
Conclusion: Robust climate-adaptive forest management remains challenging for current MORL methods, establishing BoreaRL as a valuable benchmark. The framework is open-sourced to accelerate research in multi-objective RL for climate applications.
Abstract: Boreal forests store 30-40% of terrestrial carbon, much in climate-vulnerable permafrost soils, making their management critical for climate mitigation. However, optimizing forest management for both carbon sequestration and permafrost preservation presents complex trade-offs that current tools cannot adequately address. We introduce $\textbf{BoreaRL}$, the first multi-objective reinforcement learning environment for climate-adaptive boreal forest management, featuring a physically-grounded simulator of coupled energy, carbon, and water fluxes. BoreaRL supports two training paradigms: site-specific mode for controlled studies and generalist mode for learning robust policies under environmental stochasticity. Through evaluation of multi-objective RL algorithms, we reveal a fundamental asymmetry in learning difficulty: carbon objectives are significantly easier to optimize than thaw (permafrost preservation) objectives, with thaw-focused policies showing minimal learning progress across both paradigms. In generalist settings, standard preference-conditioned approaches fail entirely, while a naive curriculum learning approach achieves superior performance by strategically selecting training episodes. Analysis of learned strategies reveals distinct management philosophies, where carbon-focused policies favor aggressive high-density coniferous stands, while effective multi-objective policies balance species composition and density to protect permafrost while maintaining carbon gains. Our results demonstrate that robust climate-adaptive forest management remains challenging for current MORL methods, establishing BoreaRL as a valuable benchmark for developing more effective approaches. We open-source BoreaRL to accelerate research in multi-objective RL for climate applications.
[383] Analyzing Generalization in Pre-Trained Symbolic Regression
Henrik Voigt, Paul Kahlmeyer, Kai Lawonn, Michael Habeck, Joachim Giesen
Main category: cs.LG
TL;DR: Transformer-based symbolic regression models show strong in-distribution performance but fail to generalize to out-of-distribution problems, limiting their practical utility.
Details
Motivation: To systematically evaluate the generalization capabilities of pre-trained transformer models for symbolic regression beyond their training data distribution.Method: Conducted empirical study testing state-of-the-art transformer-based symbolic regression approaches on both in-distribution and out-of-distribution scenarios.
Result: Significant performance degradation in out-of-distribution settings despite good in-distribution performance, revealing a critical generalization gap.
Conclusion: The generalization limitation poses a major barrier for real-world application of pre-trained symbolic regression models.
Abstract: Symbolic regression algorithms search a space of mathematical expressions for formulas that explain given data. Transformer-based models have emerged as a promising, scalable approach shifting the expensive combinatorial search to a large-scale pre-training phase. However, the success of these models is critically dependent on their pre-training data. Their ability to generalize to problems outside of this pre-training distribution remains largely unexplored. In this work, we conduct a systematic empirical study to evaluate the generalization capabilities of pre-trained, transformer-based symbolic regression. We rigorously test performance both within the pre-training distribution and on a series of out-of-distribution challenges for several state of the art approaches. Our findings reveal a significant dichotomy: while pre-trained models perform well in-distribution, the performance consistently degrades in out-of-distribution scenarios. We conclude that this generalization gap is a critical barrier for practitioners, as it severely limits the practical use of pre-trained approaches for real-world applications.
[384] Oversampling and Downsampling with Core-Boundary Awareness: A Data Quality-Driven Approach
Samir Brahim Belhaouari, Yunis Carreon Kahalan, Humaira Shaffique, Ismael Belhaouari, Ashhadul Islam
Main category: cs.LG
TL;DR: A method to differentiate between boundary-critical and core-redundant data in imbalanced classification, with boundary oversampling improving F1 scores by up to 10% and core-aware reduction compressing datasets by 90% while maintaining accuracy.
Details
Motivation: Machine learning models struggle with unbalanced classification due to inability to distinguish between critical boundary instances and redundant core samples, hindering effectiveness.Method: Proposed adaptive resampling method that systematically identifies and differentiates between boundary and core data, using boundary oversampling and core-aware reduction techniques.
Result: Boundary oversampling improved F1 score by up to 10% on 96% of datasets; core-aware reduction compressed datasets by 90% while preserving accuracy, making it 10x more powerful than original data.
Conclusion: The approach enables data-efficient learning across text, multimodal, and self-supervised scenarios, offering faster convergence, improved generalization, and significant computational savings for AI advancements.
Abstract: The effectiveness of machine learning models, particularly in unbalanced classification tasks, is often hindered by the failure to differentiate between critical instances near the decision boundary and redundant samples concentrated in the core of the data distribution. In this paper, we propose a method to systematically identify and differentiate between these two types of data. Through extensive experiments on multiple benchmark datasets, we show that the boundary data oversampling method improves the F1 score by up to 10% on 96% of the datasets, whereas our core-aware reduction method compresses datasets up to 90% while preserving their accuracy, making it 10 times more powerful than the original dataset. Beyond imbalanced classification, our method has broader implications for efficient model training, particularly in computationally expensive domains such as Large Language Model (LLM) training. By prioritizing high-quality, decision-relevant data, our approach can be extended to text, multimodal, and self-supervised learning scenarios, offering a pathway to faster convergence, improved generalization, and significant computational savings. This work paves the way for future research in data-efficient learning, where intelligent sampling replaces brute-force expansion, driving the next generation of AI advancements. Our code is available as a Python package at https://pypi.org/project/adaptive-resampling/ .
[385] Advancing Universal Deep Learning for Electronic-Structure Hamiltonian Prediction of Materials
Shi Yin, Zujian Dai, Xinyang Pan, Lixin He
Main category: cs.LG
TL;DR: NextHAM is a neural E(3)-symmetry method for efficient and generalizable materials electronic-structure Hamiltonian prediction that uses zeroth-step Hamiltonians as informative descriptors and correction terms, with a Transformer architecture and dual-space training objective to prevent error amplification.
Details
Motivation: Deep learning methods for Hamiltonian prediction face challenges in generalization due to diverse atomic types, structural patterns, and high-dimensional complexity of Hamiltonians. Traditional DFT methods are computationally expensive.Method: Proposes NextHAM with three key components: 1) zeroth-step Hamiltonians as input descriptors and output correction targets, 2) neural Transformer with strict E(3)-symmetry, 3) dual-space training objective for real and reciprocal space accuracy to prevent ghost states.
Result: Experimental results on the Materials-HAM-SOC benchmark (17,000 materials spanning 68 elements) demonstrate NextHAM achieves excellent accuracy and efficiency in predicting Hamiltonians and band structures.
Conclusion: NextHAM advances universal deep learning for Hamiltonian prediction through methodological innovations in input-output mapping simplification, symmetry-preserving architecture, and robust training objectives, validated on a comprehensive benchmark dataset.
Abstract: Deep learning methods for electronic-structure Hamiltonian prediction has offered significant computational efficiency advantages over traditional DFT methods, yet the diversity of atomic types, structural patterns, and the high-dimensional complexity of Hamiltonians pose substantial challenges to the generalization performance. In this work, we contribute on both the methodology and dataset sides to advance universal deep learning paradigm for Hamiltonian prediction. On the method side, we propose NextHAM, a neural E(3)-symmetry and expressive correction method for efficient and generalizable materials electronic-structure Hamiltonian prediction. First, we introduce the zeroth-step Hamiltonians, which can be efficiently constructed by the initial charge density of DFT, as informative descriptors of neural regression model in the input level and initial estimates of the target Hamiltonian in the output level, so that the regression model directly predicts the correction terms to the target ground truths, thereby significantly simplifying the input-output mapping for learning. Second, we present a neural Transformer architecture with strict E(3)-Symmetry and high non-linear expressiveness for Hamiltonian prediction. Third, we propose a novel training objective to ensure the accuracy performance of Hamiltonians in both real space and reciprocal space, preventing error amplification and the occurrence of “ghost states” caused by the large condition number of the overlap matrix. On the dataset side, we curate a high-quality broad-coverage large benchmark, namely Materials-HAM-SOC, comprising 17,000 material structures spanning 68 elements from six rows of the periodic table and explicitly incorporating SOC effects. Experimental results on Materials-HAM-SOC demonstrate that NextHAM achieves excellent accuracy and efficiency in predicting Hamiltonians and band structures.
[386] MCGrad:: Multicalibration at Web Scale
Lorenzo Perini, Daniel Haimovich, Fridolin Linder, Niek Tax, Dima Karamshuk, Milan Vojnovic, Nastaran Okati, Pavlos Athanasios Apostolopoulos
Main category: cs.LG
TL;DR: MCGrad is a novel multicalibration algorithm that addresses limitations of existing methods by not requiring manual subgroup specification, being scalable, and improving rather than harming other ML performance metrics.
Details
Motivation: Existing multicalibration methods have limited industry adoption due to requirements for manual subgroup specification, lack of scalability, and potential negative impact on other model performance metrics like log loss and PRAUC.Method: MCGrad is a scalable multicalibration algorithm that automatically handles subgroup calibration without requiring explicit specification of protected groups, while maintaining or improving other ML evaluation metrics.
Result: MCGrad has been successfully deployed in production at Meta and is now part of hundreds of production models, with positive results demonstrated both in production deployments and on public datasets.
Conclusion: MCGrad represents a practical multicalibration solution that overcomes key barriers to industry adoption, providing automated subgroup calibration at scale while preserving or enhancing overall model performance.
Abstract: We propose MCGrad, a novel and scalable multicalibration algorithm. Multicalibration - calibration in sub-groups of the data - is an important property for the performance of machine learning-based systems. Existing multicalibration methods have thus far received limited traction in industry. We argue that this is because existing methods (1) require such subgroups to be manually specified, which ML practitioners often struggle with, (2) are not scalable, or (3) may harm other notions of model performance such as log loss and Area Under the Precision-Recall Curve (PRAUC). MCGrad does not require explicit specification of protected groups, is scalable, and often improves other ML evaluation metrics instead of harming them. MCGrad has been in production at Meta, and is now part of hundreds of production models. We present results from these deployments as well as results on public datasets.
[387] Towards Self-Supervised Foundation Models for Critical Care Time Series
Katja Naasunnguaq Jagd, Rachael DeVries, Ole Winther
Main category: cs.LG
TL;DR: Introduces a Bi-Axial Transformer foundation model for critical care time series that outperforms supervised baselines in mortality prediction, especially for small datasets (<5,000 samples).
Details
Motivation: Domain-specific foundation models for healthcare are expanding rapidly, but critical care time series models remain underexplored due to limited dataset size and availability.Method: Uses Bi-Axial Transformer (BAT) architecture trained on pooled electronic health record datasets via self-supervised pre-training, then fine-tuned for mortality prediction on distinct datasets.
Result: The model demonstrates effective transfer learning and outperforms supervised baselines, particularly for small datasets with less than 5,000 samples.
Conclusion: Self-supervised foundation models show potential for supporting generalizable and robust clinical applications in resource-limited critical care settings.
Abstract: Domain-specific foundation models for healthcare have expanded rapidly in recent years, yet foundation models for critical care time series remain relatively underexplored due to the limited size and availability of datasets. In this work, we introduce an early-stage pre-trained foundation model for critical care time-series based on the Bi-Axial Transformer (BAT), trained on pooled electronic health record datasets. We demonstrate effective transfer learning by fine-tuning the model on a dataset distinct from the training sources for mortality prediction, where it outperforms supervised baselines, particularly for small datasets ($<5,000$). These contributions highlight the potential of self-supervised foundation models for critical care times series to support generalizable and robust clinical applications in resource-limited settings.
[388] PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning
Xueliang Zhao, Wei Wu, Jian Guan, Zhuocheng Gong, Lingpeng Kong
Main category: cs.LG
TL;DR: PromptCoT 2.0 is a scalable framework that uses an EM loop to iteratively refine rationales for generating harder and more diverse training problems, enabling self-play and supervised fine-tuning that achieves state-of-the-art results on reasoning benchmarks.
Details
Motivation: Current LLM training faces a bottleneck in high-quality reasoning problems - human-curated datasets are limited and costly, while synthetic corpora are often too easy or narrow.Method: Uses expectation-maximization (EM) loop to iteratively refine rationales that guide prompt construction, producing harder and more diverse problems. Supports two training regimes: Self-Play (autonomous improvement via verifiable feedback) and Supervised Fine-Tuning (weaker models learn from teacher-distilled traces).
Result: Achieved SOTA results: Qwen3-30B improved by +4.4-+6.1 on AIME/HMMT/LiveCodeBench, +35 Elo on Codeforces. Qwen2.5-7B trained solely on synthetic prompts reached 73.1% (AIME 24), 65.6% (AIME 25), surpassing human/hybrid data training.
Conclusion: Prompt synthesis establishes a new axis for scaling reasoning capabilities, with PromptCoT 2.0 serving as a scalable foundation for future open-source models.
Abstract: Large language models (LLMs) are evolving from conversational systems into strong reasoners for tasks such as Olympiad mathematics and competitive programming. While scaling parameters and test-time computation has driven progress, a key bottleneck is the lack of high-quality training problems: human-curated datasets are costly and limited, while existing synthetic corpora are often too easy or narrow. PromptCoT 1.0 showed that injecting rationales into prompt synthesis increases problem difficulty. Building on this, we present PromptCoT 2.0, a scalable framework that replaces hand-crafted heuristics with an expectation-maximization (EM) loop, where rationales are iteratively refined to guide prompt construction. This produces problems that are both harder and more diverse than prior corpora. The synthetic prompts support two post-training regimes: (1) Self-Play, where strong models improve autonomously via verifiable feedback without stronger teachers; and (2) Supervised Fine-Tuning (SFT), where weaker models learn from teacher-distilled traces. Extensive experiments demonstrate the effectiveness of this approach. In self-play, applying PromptCoT 2.0 to Qwen3-30B-A3B-Thinking-2507 sets new state-of-the-art results at the 30B scale, with +4.4, +4.8, and +5.3 on AIME 24/25 and HMMT 25, +6.1 and +5.0 on LiveCodeBench v5/v6, and +35 Elo on Codeforces. In SFT, training Qwen2.5-7B-Instruct solely on synthetic prompts boosts accuracy to 73.1 (AIME 24), 65.6 (AIME 25), and 53.4 (LiveCodeBench v5), surpassing models trained on human or hybrid data. Analyses further confirm that PromptCoT 2.0 yields fundamentally harder and distributionally distinct problems. These results establish prompt synthesis as a new axis for scaling reasoning and position PromptCoT 2.0 as a scalable foundation for future open-source models. The implementation is available at https://github.com/inclusionAI/PromptCoT.
[389] Pure Exploration via Frank-Wolfe Self-Play
Xinyu Liu, Chao Qin, Wei You
Main category: cs.LG
TL;DR: The paper presents Frank-Wolfe Self-Play (FWSP), a projection-free method for pure exploration in structured stochastic multi-armed bandits, addressing challenges like nonunique optima and nonsmoothness through differential-inclusion analysis.
Details
Motivation: To efficiently identify the correct hypothesis in structured bandit problems, addressing pathological issues like nonunique optima and optimal designs with zero mass on the best arm that complicate algorithm design.Method: Reformulates the maximin optimization as a concave-convex saddle-point problem using mixed strategies, leading to FWSP with one-hot updates. Uses differential-inclusion arguments and Lyapunov analysis for convergence proofs, and proposes a posterior sampling-based learning algorithm.
Result: Proves convergence of the game value for best-arm identification in linear bandits, showing vanishing duality gap and uniform global convergence to optimal value through continuous-time analysis.
Conclusion: FWSP provides a theoretically grounded, tuning-free approach for structured bandit exploration that handles complex structural constraints and achieves optimal performance despite pathological challenges.
Abstract: We study pure exploration in structured stochastic multi-armed bandits, aiming to efficiently identify the correct hypothesis from a finite set of alternatives. For a broad class of tasks, asymptotic analyses reduce to a maximin optimization that admits a two-player zero-sum game interpretation between an experimenter and a skeptic: the experimenter allocates measurements to rule out alternatives while the skeptic proposes alternatives. We reformulate the game by allowing the skeptic to adopt a mixed strategy, yielding a concave-convex saddle-point problem. This viewpoint leads to Frank-Wolfe Self-Play (FWSP): a projection-free, regularization-free, tuning-free method whose one-hot updates on both sides match the bandit sampling paradigm. However, structural constraints introduce sharp pathologies that complicate algorithm design and analysis: our linear-bandit case study exhibits nonunique optima, optimal designs with zero mass on the best arm, bilinear objectives, and nonsmoothness at the boundary. We address these challenges via a differential-inclusion argument, proving convergence of the game value for best-arm identification in linear bandits. Our analysis proceeds through a continuous-time limit: a differential inclusion with a Lyapunov function that decays exponentially, implying a vanishing duality gap and convergence to the optimal value. Although Lyapunov analysis requires differentiability of the objective, which is not guaranteed on the boundary, we show that along continuous trajectories the algorithm steers away from pathological nonsmooth points and achieves uniform global convergence to the optimal game value. We then embed the discrete-time updates into a perturbed flow and show that the discrete game value also converges. Building on FWSP, we further propose a learning algorithm based on posterior sampling. Numerical experiments demonstrate a vanishing duality gap.
[390] Latent Iterative Refinement Flow: A Geometric-Constrained Approach for Few-Shot Generation
Songtao Li, Zhenyu Liao, Tianqi Hou, Ting Gao
Main category: cs.LG
TL;DR: LIRF is a novel few-shot generation method that reframes the problem as progressive densification of geometrically structured manifolds, using a manifold-preserving latent space and iterative correction to overcome overfitting and mode collapse.
Details
Motivation: Existing few-shot generation methods suffer from overfitting, mode collapse, and inherit biases from large models while neglecting latent space geometry. LIRF aims to address these limitations by focusing on geometric structure preservation.Method: LIRF establishes a stable latent space using an autoencoder with manifold-preservation loss, then employs an iterative generate-correct-augment cycle with a contractive correction operator that pulls samples toward the data manifold while preserving diversity.
Result: The method demonstrates predictable convergence (proven by Convergence Theorem) and scalability to high-resolution image generation (AFHQ-Cat). Ablation studies confirm both the manifold-preserving latent space and correction mechanism are critical.
Conclusion: LIRF provides a theoretically grounded and highly effective solution for data-scarce generative modeling that overcomes key limitations of existing approaches.
Abstract: Few-shot generation, the synthesis of high-quality and diverse samples from limited training data, remains a significant challenge in generative modeling. Existing methods trained from scratch often fail to overcome overfitting and mode collapse, and fine-tuning large models can inherit biases while neglecting the crucial geometric structure of the latent space. To address these limitations, we introduce Latent Iterative Refinement Flow (LIRF), a novel approach that reframes few-shot generation as the progressive densification of geometrically structured manifold. LIRF establishes a stable latent space using an autoencoder trained with our novel \textbf{manifold-preservation loss} $L_{\text{manifold}}$. This loss ensures that the latent space maintains the geometric and semantic correspondence of the input data. Building on this, we propose an iterative generate-correct-augment cycle. Within this cycle, candidate samples are refined by a geometric \textbf{correction operator}, a provably contractive mapping that pulls samples toward the data manifold while preserving diversity. We also provide the \textbf{Convergence Theorem} demonstrating a predictable decrease in Hausdorff distance between generated and true data manifold. We also demonstrate the framework’s scalability by generating coherent, high-resolution images on AFHQ-Cat. Ablation studies confirm that both the manifold-preserving latent space and the contractive correction mechanism are critical components of this success. Ultimately, LIRF provides a solution for data-scarce generative modeling that is not only theoretically grounded but also highly effective in practice.
[391] On the Fragility of Contribution Score Computation in Federated Learning
Balazs Pejo, Marcell Frank, Krisztian Varga, Peter Veliczky
Main category: cs.LG
TL;DR: This paper reveals that contribution evaluation in federated learning is fragile and can be distorted by both architectural choices (different aggregation methods) and intentional manipulation (poisoning attacks).
Details
Motivation: To investigate the vulnerabilities in contribution evaluation mechanisms that are crucial for ensuring fairness and incentivizing participation in federated learning systems.Method: The study explores two perspectives: architectural sensitivity (impact of different model aggregation methods) and intentional manipulation (poisoning attacks). Extensive experiments were conducted across diverse datasets and model architectures using the Flower framework.
Result: Both the choice of aggregation method and the presence of attackers significantly distort contribution scores. Advanced aggregation techniques designed for unreliable clients can unintentionally alter scores, while malicious participants can strategically manipulate updates to inflate their own contributions.
Conclusion: The findings highlight a critical need for more robust contribution evaluation schemes in federated learning, as current methods are susceptible to significant distortions from both technical choices and adversarial manipulation.
Abstract: This paper investigates the fragility of contribution evaluation in federated learning, a critical mechanism for ensuring fairness and incentivizing participation. We argue that contribution scores are susceptible to significant distortions from two fundamental perspectives: architectural sensitivity and intentional manipulation. First, we explore how different model aggregation methods impact these scores. While most research assumes a basic averaging approach, we demonstrate that advanced techniques, including those designed to handle unreliable or diverse clients, can unintentionally yet significantly alter the final scores. Second, we explore vulnerabilities posed by poisoning attacks, where malicious participants strategically manipulate their model updates to inflate their own contribution scores or reduce the importance of other participants. Through extensive experiments across diverse datasets and model architectures, implemented within the Flower framework, we rigorously show that both the choice of aggregation method and the presence of attackers are potent vectors for distorting contribution scores, highlighting a critical need for more robust evaluation schemes.
[392] Exploration with Foundation Models: Capabilities, Limitations, and Hybrid Approaches
Remo Sasso, Michelangelo Conserva, Dominik Jeurissen, Paulo Rauber
Main category: cs.LG
TL;DR: Benchmarking foundation models (LLMs/VLMs) for zero-shot exploration in RL reveals a knowing-doing gap - VLMs understand objectives but fail at precise control. A hybrid framework shows VLM guidance can improve early-stage sample efficiency.
Details
Motivation: To understand foundation models' capabilities as zero-shot exploration agents in RL, particularly in sparse-reward settings where exploration is challenging.Method: Benchmark LLMs and VLMs on multi-armed bandits, Gridworlds, and sparse-reward Atari games. Investigate a simple on-policy hybrid framework in controlled best-case scenarios.
Result: VLMs can infer high-level objectives from visual input but consistently fail at precise low-level control. In hybrid framework, VLM guidance significantly improves early-stage sample efficiency.
Conclusion: Foundation models are better suited for guiding exploration rather than end-to-end control, providing clear analysis of their potential and constraints in RL settings.
Abstract: Exploration in reinforcement learning (RL) remains challenging, particularly in sparse-reward settings. While foundation models possess strong semantic priors, their capabilities as zero-shot exploration agents in classic RL benchmarks are not well understood. We benchmark LLMs and VLMs on multi-armed bandits, Gridworlds, and sparse-reward Atari to test zero-shot exploration. Our investigation reveals a key limitation: while VLMs can infer high-level objectives from visual input, they consistently fail at precise low-level control: the “knowing-doing gap”. To analyze a potential bridge for this gap, we investigate a simple on-policy hybrid framework in a controlled, best-case scenario. Our results in this idealized setting show that VLM guidance can significantly improve early-stage sample efficiency, providing a clear analysis of the potential and constraints of using foundation models to guide exploration rather than for end-to-end control.
[393] MMSE-Calibrated Few-Shot Prompting for Alzheimer’s Detection
Jana Sweidan, Mounim A. El-Yacoubi, Nasredine Semmar
Main category: cs.LG
TL;DR: This paper presents two prompting methods for detecting Alzheimer’s disease from speech transcripts using large language models, achieving state-of-the-art results on the ADReSS dataset.
Details
Motivation: To develop training-free methods for Alzheimer's disease detection from speech transcripts using prompting techniques with large language models, improving interpretability and performance.Method: Two prompting variants: (1) MMSE-Proxy Prompting - few-shot examples with probabilities mapped to Mini-Mental State Examination bands; (2) Reasoning-augmented Prompting - using multimodal LLM (GPT-5) to generate reasoning-enhanced examples from Cookie Theft image, transcript, and MMSE data.
Result: MMSE-Proxy Prompting achieved 0.82 accuracy and 0.86 AUC; Reasoning-augmented Prompting achieved 0.82 accuracy and 0.83 AUC, both representing state-of-the-art prompting results.
Conclusion: The study demonstrates effective training-free Alzheimer’s disease detection from speech transcripts using advanced prompting techniques, with novel contributions in MMSE-anchored probabilities and multimodal construction for improved interpretability.
Abstract: Prompting large language models is a training-free method for detecting Alzheimer’s disease from speech transcripts. Using the ADReSS dataset, we revisit zero-shot prompting and study few-shot prompting with a class-balanced protocol using nested interleave and a strict schema, sweeping up to 20 examples per class. We evaluate two variants achieving state-of-the-art prompting results. (i) MMSE-Proxy Prompting: each few-shot example carries a probability anchored to Mini-Mental State Examination bands via a deterministic mapping, enabling AUC computing; this reaches 0.82 accuracy and 0.86 AUC (ii) Reasoning-augmented Prompting: few-shot examples pool is generated with a multimodal LLM (GPT-5) that takes as input the Cookie Theft image, transcript, and MMSE to output a reasoning and MMSE-aligned probability; evaluation remains transcript-only and reaches 0.82 accuracy and 0.83 AUC. To our knowledge, this is the first ADReSS study to anchor elicited probabilities to MMSE and to use multimodal construction to improve interpretability.
[394] TABFAIRGDT: A Fast Fair Tabular Data Generator using Autoregressive Decision Trees
Emmanouil Panagiotou, Benoît Ronval, Arjun Roy, Ludwig Bothmann, Bernd Bischl, Siegfried Nijssen, Eirini Ntoutsi
Main category: cs.LG
TL;DR: TABFAIRGDT is a novel method for generating fair synthetic tabular data using autoregressive decision trees with soft leaf resampling to mitigate bias while preserving utility.
Details
Motivation: Machine learning models often inherit biases from training data, and while generative models show promise for bias mitigation, existing approaches rely on deep architectures despite simpler models being effective for tabular data.Method: Uses autoregressive decision trees with a soft leaf resampling technique that adjusts decision tree outputs to reduce bias. The approach is non-parametric, handles mixed feature types, and requires no data pre-processing.
Result: Outperforms state-of-the-art deep generative models on benchmark fairness datasets, achieving better fairness-utility trade-off and higher synthetic data quality. Achieves 72% average speedup over fastest baseline and generates fair synthetic data for medium-sized datasets in one second on CPU.
Conclusion: TABFAIRGDT provides an efficient, lightweight, and CPU-compatible solution for generating fair synthetic tabular data, making it ideal for real-world fairness-sensitive applications.
Abstract: Ensuring fairness in machine learning remains a significant challenge, as models often inherit biases from their training data. Generative models have recently emerged as a promising approach to mitigate bias at the data level while preserving utility. However, many rely on deep architectures, despite evidence that simpler models can be highly effective for tabular data. In this work, we introduce TABFAIRGDT, a novel method for generating fair synthetic tabular data using autoregressive decision trees. To enforce fairness, we propose a soft leaf resampling technique that adjusts decision tree outputs to reduce bias while preserving predictive performance. Our approach is non-parametric, effectively capturing complex relationships between mixed feature types, without relying on assumptions about the underlying data distributions. We evaluate TABFAIRGDT on benchmark fairness datasets and demonstrate that it outperforms state-of-the-art (SOTA) deep generative models, achieving better fairness-utility trade-off for downstream tasks, as well as higher synthetic data quality. Moreover, our method is lightweight, highly efficient, and CPU-compatible, requiring no data pre-processing. Remarkably, TABFAIRGDT achieves a 72% average speedup over the fastest SOTA baseline across various dataset sizes, and can generate fair synthetic data for medium-sized datasets (10 features, 10K samples) in just one second on a standard CPU, making it an ideal solution for real-world fairness-sensitive applications.
[395] How deep is your network? Deep vs. shallow learning of transfer operators
Mohammad Tabish, Benedict Leimkuhler, Stefan Klus
Main category: cs.LG
TL;DR: RaNNDy is a randomized neural network approach for learning transfer operators and their spectral decompositions from data, using random hidden layer weights and training only the output layer to achieve fast training with good accuracy.
Details
Motivation: To develop a more efficient method for learning transfer operators that reduces training time and resources while avoiding common deep learning issues like hyperparameter sensitivity and slow convergence.Method: Randomly select weights for hidden layers, train only the output layer to compute a closed-form solution representing eigenfunctions, and use ensemble learning for uncertainty estimation.
Result: The approach significantly reduces training time without noticeable accuracy reduction, successfully applied to various operators including Koopman, Perron-Frobenius, and Schrödinger operators.
Conclusion: RaNNDy provides an efficient framework for spectral analysis of transfer operators with uncertainty quantification, though the paper acknowledges both strengths and weaknesses through numerical examples.
Abstract: We propose a randomized neural network approach called RaNNDy for learning transfer operators and their spectral decompositions from data. The weights of the hidden layers of the neural network are randomly selected and only the output layer is trained. The main advantage is that without a noticeable reduction in accuracy, this approach significantly reduces the training time and resources while avoiding common problems associated with deep learning such as sensitivity to hyperparameters and slow convergence. Additionally, the proposed framework allows us to compute a closed-form solution for the output layer which directly represents the eigenfunctions of the operator. Moreover, it is possible to estimate uncertainties associated with the computed spectral properties via ensemble learning. We present results for different dynamical operators, including Koopman and Perron-Frobenius operators, which have important applications in analyzing the behavior of complex dynamical systems, and the Schr"odinger operator. The numerical examples, which highlight the strengths but also weaknesses of the proposed framework, include several stochastic dynamical systems, protein folding processes, and the quantum harmonic oscillator.
[396] Learnable Sampler Distillation for Discrete Diffusion Models
Feiyang Fu, Tongxian Guo, Zhaoqiang Liu
Main category: cs.LG
TL;DR: Proposes learnable sampler distillation (LSD) to accelerate discrete diffusion models by training student samplers to match teacher trajectories with adaptive coefficients and non-uniform time scheduling.
Details
Motivation: Discrete diffusion models suffer from inefficient sampling requiring many steps, and accelerating them with larger step sizes degrades generation quality due to compounding decoding and discretization errors.Method: LSD uses distillation where a student sampler learns to align its score trajectory with a high-quality teacher sampler through learnable coefficients. LSD+ adds learnable time schedules for non-uniform step allocation.
Result: Experiments across text generation, image generation, and synthetic tasks show LSD outperforms existing samplers, achieving higher quality with significantly fewer sampling steps.
Conclusion: The proposed LSD approach enables efficient and high-fidelity sampling for discrete diffusion models, making them more practical for real-world applications.
Abstract: Discrete diffusion models (DDMs) have shown powerful generation ability for discrete data modalities like text and molecules. However, their practical application is hindered by inefficient sampling, requiring a large number of sampling steps. Accelerating DDMs by using larger step sizes typically introduces significant problems in generation quality, as it amplifies the impact of both the compounding decoding error due to factorized predictions and discretization error from numerical approximations, leading to a significant decrease in sampling quality. To address these challenges, we propose learnable sampler distillation (LSD), a novel approach to train fast and high-fidelity samplers for DDMs. LSD employs a distillation approach where a student sampler with a few steps learns to align its intermediate score trajectory with that of a high-quality teacher sampler with numerous steps. This alignment is achieved by optimizing learnable sampler coefficients that adaptively adjust sampling dynamics. Additionally, we further propose LSD+, which also learns time schedules that allocate steps non-uniformly. Experiments across text generation, image generation, and synthetic tasks demonstrate that our proposed approaches outperform existing samplers for DDMs, achieving substantially higher sampling quality with significantly fewer sampling steps. Our code is available at \href{https://github.com/feiyangfu/LSD}{https://github.com/feiyangfu/LSD}.
[397] From Samples to Scenarios: A New Paradigm for Probabilistic Forecasting
Xilin Dai, Zhijian Xu, Wanxu Cai, Qiang Xu
Main category: cs.LG
TL;DR: The paper introduces Probabilistic Scenarios as an alternative to sampling-based probabilistic forecasting, proposing TimePrism model that achieves state-of-the-art results by directly producing scenario-probability pairs.
Details
Motivation: Current sampling-based probabilistic forecasting models suffer from limitations like lack of explicit probabilities, inadequate coverage, and high computational costs.Method: Proposes TimePrism - a simple model with three parallel linear layers that directly produces finite sets of {Scenario, Probability} pairs instead of using Monte Carlo sampling.
Result: TimePrism achieves 9 out of 10 state-of-the-art results across five benchmark datasets on two metrics, demonstrating superior performance over sampling-based approaches.
Conclusion: The Probabilistic Scenarios paradigm offers a promising research direction for forecasting by reframing the learning objective from modeling continuous probability spaces to representing plausible scenarios with explicit probabilities.
Abstract: Most state-of-the-art probabilistic time series forecasting models rely on sampling to represent future uncertainty. However, this paradigm suffers from inherent limitations, such as lacking explicit probabilities, inadequate coverage, and high computational costs. In this work, we introduce \textbf{Probabilistic Scenarios}, an alternative paradigm designed to address the limitations of sampling. It operates by directly producing a finite set of {Scenario, Probability} pairs, thus avoiding Monte Carlo-like approximation. To validate this paradigm, we propose \textbf{TimePrism}, a simple model composed of only three parallel linear layers. Surprisingly, TimePrism achieves 9 out of 10 state-of-the-art results across five benchmark datasets on two metrics. The effectiveness of our paradigm comes from a fundamental reframing of the learning objective. Instead of modeling an entire continuous probability space, the model learns to represent a set of plausible scenarios and corresponding probabilities. Our work demonstrates the potential of the Probabilistic Scenarios paradigm, opening a promising research direction in forecasting beyond sampling.
[398] Faster Than SVD, Smarter Than SGD: The OPLoRA Alternating Update
Abdulla Jasem Almansoori, Maria Ivanova, Andrey Veprikov, Aleksandr Beznosikov, Samuel Horváth, Martin Takáč
Main category: cs.LG
TL;DR: OPLoRA is a memory-efficient optimizer that improves LoRA fine-tuning by casting it as an interpretable sub-problem and solving it with alternating least squares updates, achieving performance close to full SVD training with much lower memory usage.
Details
Motivation: There is a performance gap between full training with low-rank projections (SVDLoRA) and standard LoRA fine-tuning, indicating that LoRA optimization can be improved to better match full training performance while maintaining memory efficiency.Method: OPLoRA formulates LoRA optimization as an interpretable sub-problem and solves it efficiently using alternating least squares updates, requiring only 1-2 alternating steps to closely match truncated SVD. It also supports momentum through a low-rank estimate method called LoRSum, with memory usage comparable to Adam.
Result: OPLoRA consistently approaches SVDLoRA’s performance across linear tasks, MNIST, CIFAR-100, and RoBERTa-base (MNLI) while using significantly less memory than full SVD training.
Conclusion: OPLoRA effectively bridges the performance gap between standard LoRA and full SVD training, providing a memory-efficient optimization approach that maintains high performance while being computationally practical.
Abstract: Low-Rank Adaptation (LoRA) fine-tunes large models by learning low-rank updates on top of frozen weights, dramatically reducing trainable parameters and memory. However, there is still a gap between full training with low-rank projections (SVDLoRA) and LoRA fine-tuning, indicating that LoRA steps can be further improved. In this study, we propose OPLoRA, a memory-efficient optimizer that closes this gap by casting LoRA optimization as an interpretable sub-problem and solving it efficiently with alternating least squares updates, where 1-2 alternating steps are empirically found to be sufficient to closely match truncated SVD without ever forming the full matrix. We also retrieve the recently proposed preconditioning methods for LoRA as a special case. OPLoRA supports momentum by maintaining a low-rank estimate using the same subroutine (LoRSum) for computing the step, with a memory budget of 3 times the number of LoRA parameters (i.e., same as Adam). We also propose an experimental scaled variant that uses the K-FAC metric, which could be of interest. Across a linear task, MNIST, CIFAR-100, and RoBERTa-base (MNLI), OPLoRA consistently approaches SVDLoRA’s performance using significantly less memory.
[399] RAD: Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis
Haolin Li, Tianjie Dai, Zhe Chen, Siyuan Du, Jiangchao Yao, Ya Zhang, Yanfeng Wang
Main category: cs.LG
TL;DR: RAD is a retrieval-augmented framework that explicitly injects external medical knowledge into multimodal models for clinical diagnosis, achieving state-of-the-art performance while improving interpretability.
Details
Motivation: Current AI medical approaches rely on implicitly encoded knowledge, neglecting task-specific knowledge needed for diverse downstream diagnostic tasks. The paper aims to address this limitation by explicitly incorporating external medical knowledge.Method: RAD uses three mechanisms: retrieval and refinement of disease-centered knowledge from multiple sources, guideline-enhanced contrastive loss to constrain feature distances, and dual transformer decoder using guidelines as queries to steer cross-modal fusion.
Result: Extensive evaluations across four datasets with different anatomies demonstrate RAD’s generalizability and state-of-the-art performance. The framework enables more precise focus on abnormal regions and critical indicators.
Conclusion: RAD provides evidence-based, trustworthy diagnosis by aligning models with clinical workflows from guideline acquisition to decision-making, while introducing quantitative criteria for interpretability assessment.
Abstract: Clinical diagnosis is a highly specialized discipline requiring both domain expertise and strict adherence to rigorous guidelines. While current AI-driven medical research predominantly focuses on knowledge graphs or natural text pretraining paradigms to incorporate medical knowledge, these approaches primarily rely on implicitly encoded knowledge within model parameters, neglecting task-specific knowledge required by diverse downstream tasks. To address this limitation, we propose Retrieval-Augmented Diagnosis (RAD), a novel framework that explicitly injects external knowledge into multimodal models directly on downstream tasks. Specifically, RAD operates through three key mechanisms: retrieval and refinement of disease-centered knowledge from multiple medical sources, a guideline-enhanced contrastive loss that constrains the latent distance between multi-modal features and guideline knowledge, and the dual transformer decoder that employs guidelines as queries to steer cross-modal fusion, aligning the models with clinical diagnostic workflows from guideline acquisition to feature extraction and decision-making. Moreover, recognizing the lack of quantitative evaluation of interpretability for multimodal diagnostic models, we introduce a set of criteria to assess the interpretability from both image and text perspectives. Extensive evaluations across four datasets with different anatomies demonstrate RAD’s generalizability, achieving state-of-the-art performance. Furthermore, RAD enables the model to concentrate more precisely on abnormal regions and critical indicators, ensuring evidence-based, trustworthy diagnosis. Our code is available at https://github.com/tdlhl/RAD.
[400] Pi-Transformer: A Physics-informed Attention Mechanism for Time Series Anomaly Detection
Sepehr Maleki, Negar Pourmoazemi
Main category: cs.LG
TL;DR: Pi-Transformer is a physics-informed transformer model for multivariate time series anomaly detection that uses dual attention pathways (data-driven and prior-based) to capture temporal invariants and cross-channel coordination, achieving state-of-the-art performance on multiple benchmarks.
Details
Motivation: Anomalies in multivariate time series often stem from temporal context and cross-channel coordination rather than isolated outliers, requiring models that can capture these complex patterns while being robust and interpretable.Method: Uses two attention pathways: data-driven series attention and smoothly evolving prior attention encoding temporal invariants (scale-related self-similarity, phase synchrony). Combines reconstruction objective with divergence term to encourage agreement while maintaining distinction between attentions. Prior is regularized to evolve smoothly and distilled toward dataset-level statistics.
Result: Achieves state-of-the-art or highly competitive F1 scores across five benchmarks (SMD, MSL, SMAP, SWaT, PSM), with particular strength on timing and phase-breaking anomalies. Case analyses show complementary behavior of the two streams and interpretable detections around regime changes.
Conclusion: Embedding physics-informed priors into attention yields a calibrated and robust approach to anomaly detection in complex multivariate systems, providing both high performance and interpretability.
Abstract: Anomalies in multivariate time series often arise from temporal context and cross-channel coordination rather than isolated outliers. We present Pi-Transformer, a physics-informed transformer with two attention pathways: a data-driven series attention and a smoothly evolving prior attention that encodes temporal invariants such as scale-related self-similarity and phase synchrony. The prior acts as a stable reference that calibrates reconstruction error. During training, we pair a reconstruction objective with a divergence term that encourages agreement between the two attentions while keeping them meaningfully distinct; the prior is regularised to evolve smoothly and is lightly distilled towards dataset-level statistics. At inference, the model combines an alignment-weighted reconstruction signal (Energy) with a mismatch signal that highlights timing and phase disruptions, and fuses them into a single score for detection. Across five benchmarks (SMD, MSL, SMAP, SWaT, and PSM), Pi-Transformer achieves state-of-the-art or highly competitive F1, with particular strength on timing and phase-breaking anomalies. Case analyses show complementary behaviour of the two streams and interpretable detections around regime changes. Embedding physics-informed priors into attention yields a calibrated and robust approach to anomaly detection in complex multivariate systems. Code is publicly available at this GitHub repository\footnote{https://github.com/sepehr-m/Pi-Transformer}.
[401] Learning Robust Penetration-Testing Policies under Partial Observability: A systematic evaluation
Raphael Simon, Pieter Libin, Wim Mees
Main category: cs.LG
TL;DR: This paper investigates reinforcement learning approaches for penetration testing in partially observable environments, comparing PPO variants with history aggregation techniques to address the challenges of partial observability in cybersecurity simulations.
Details
Motivation: Penetration testing presents a sequential decision-making problem suitable for RL automation, but partial observability invalidates the Markov property, requiring history aggregation or belief state estimation to learn successful policies in real-world cybersecurity scenarios.Method: The researchers use vanilla PPO as a baseline and compare various PPO variants designed to mitigate partial observability, including frame-stacking, augmenting observations with historical information, and employing recurrent or transformer-based architectures. They conduct systematic empirical analysis across different host network sizes.
Result: The study finds that penetration testing tasks greatly benefit from history aggregation, with these approaches converging three times faster than other methods. Manual inspection of learned policies reveals clear distinctions and provides insights beyond quantitative results.
Conclusion: History aggregation techniques are crucial for developing robust and transferable penetration testing policies that can perform reliably across diverse and unpredictable real-world cybersecurity environments.
Abstract: Penetration testing, the simulation of cyberattacks to identify security vulnerabilities, presents a sequential decision-making problem well-suited for reinforcement learning (RL) automation. Like many applications of RL to real-world problems, partial observability presents a major challenge, as it invalidates the Markov property present in Markov Decision Processes (MDPs). Partially Observable MDPs require history aggregation or belief state estimation to learn successful policies. We investigate stochastic, partially observable penetration testing scenarios over host networks of varying size, aiming to better reflect real-world complexity through more challenging and representative benchmarks. This approach leads to the development of more robust and transferable policies, which are crucial for ensuring reliable performance across diverse and unpredictable real-world environments. Using vanilla Proximal Policy Optimization (PPO) as a baseline, we compare a selection of PPO variants designed to mitigate partial observability, including frame-stacking, augmenting observations with historical information, and employing recurrent or transformer-based architectures. We conduct a systematic empirical analysis of these algorithms across different host network sizes. We find that this task greatly benefits from history aggregation. Converging three times faster than other approaches. Manual inspection of the learned policies by the algorithms reveals clear distinctions and provides insights that go beyond quantitative results.
[402] Diffusion-Augmented Contrastive Learning: A Noise-Robust Encoder for Biosignal Representations
Rami Zewail
Main category: cs.LG
TL;DR: DACL is a hybrid framework combining diffusion models and supervised contrastive learning to create robust biosignal representations, achieving 0.7815 AUROC on ECG data.
Details
Motivation: Traditional data augmentation methods fail to capture complex variations in physiological data, necessitating more effective representation learning approaches for biosignals.Method: Uses a VAE on Scattering Transformer features to create latent space, applies diffusion forward process for data augmentation, and trains a U-Net encoder with supervised contrastive learning to balance class discrimination and noise robustness.
Result: Achieved competitive AUROC of 0.7815 on PhysioNet 2017 ECG dataset.
Conclusion: Establishes a new paradigm using diffusion process to drive contrastive learning, creating noise-invariant embeddings with strong class separability foundation.
Abstract: Learning robust representations for biosignals is often hampered by the challenge of designing effective data augmentations.Traditional methods can fail to capture the complex variations inherent in physiological data. Within this context, we propose a novel hybrid framework, Diffusion-Augmented Contrastive Learning (DACL), that fuses concepts from diffusion models and supervised contrastive learning. The DACL framework operates on a latent space created by a lightweight Variational Autoencoder (VAE) trained on our novel Scattering Transformer (ST) features [12]. It utilizes the diffusion forward process as a principled data augmentation technique to generate multiple noisy views of these latent embeddings. A U-Net style encoder is then trained with a supervised contrastive objective to learn a representation that balances class discrimination with robustness to noise across various diffusion time steps. We evaluated this proof-of-concept method on the PhysioNet 2017 ECG dataset, achieving a competitive AUROC of 0.7815. This work establishes a new paradigm for representation learning by using the diffusion process itself to drive the contrastive objective, creating noise-invariant embeddings that demonstrate a strong foundation for class separability.
[403] One Filters All: A Generalist Filter for State Estimation
Shiqi Liu, Wenhan Cao, Chang Liu, Zeyu He, Tianyi Zhang, Shengbo Eben Li
Main category: cs.LG
TL;DR: LLM-Filter is a novel filtering framework that uses large language models for state estimation in dynamical systems by embedding observations with text prototypes.
Details
Motivation: To leverage the reasoning knowledge embedded in pre-trained LLMs for improved state estimation in dynamical systems, overcoming limitations of traditional filtering approaches.Method: Uses System-as-Prompt (SaP) structure to embed noisy observations with text prototypes, enabling LLMs to understand estimation tasks through proper modality alignment with frozen LLM.
Result: Outperforms state-of-the-art learning-based approaches, shows exceptional generalization to changed/unseen environments, and exhibits scaling-law behavior with accuracy improving with larger models and longer training.
Conclusion: LLM-Filter demonstrates promising potential as a foundation model for filtering tasks, benefiting from LLMs’ reasoning capabilities and generalization properties.
Abstract: Estimating hidden states in dynamical systems, also known as optimal filtering, is a long-standing problem in various fields of science and engineering. In this paper, we introduce a general filtering framework, \textbf{LLM-Filter}, which leverages large language models (LLMs) for state estimation by embedding noisy observations with text prototypes. In various experiments for classical dynamical systems, we find that first, state estimation can significantly benefit from the reasoning knowledge embedded in pre-trained LLMs. By achieving proper modality alignment with the frozen LLM, LLM-Filter outperforms the state-of-the-art learning-based approaches. Second, we carefully design the prompt structure, System-as-Prompt (SaP), incorporating task instructions that enable the LLM to understand the estimation tasks. Guided by these prompts, LLM-Filter exhibits exceptional generalization, capable of performing filtering tasks accurately in changed or even unseen environments. We further observe a scaling-law behavior in LLM-Filter, where accuracy improves with larger model sizes and longer training times. These findings make LLM-Filter a promising foundation model of filtering.
[404] You Only Measure Once: On Designing Single-Shot Quantum Machine Learning Models
Chen-Yu Liu, Leonardo Placidi, Kuan-Cheng Chen, Samuel Yen-Chi Chen, Gabriel Matos
Main category: cs.LG
TL;DR: Yomo (You Only Measure Once) is a quantum machine learning design that enables accurate inference with dramatically fewer measurements, down to single-shot regime, by replacing Pauli expectation-value outputs with probability aggregation and using sharp prediction loss functions.
Details
Motivation: Current QML models require repeated measurements (shots) for reliable predictions, leading to high inference costs and time overhead, which is problematic since quantum hardware access is typically priced proportionally to shot count.Method: Yomo replaces conventional Pauli expectation-value outputs with a probability aggregation mechanism and introduces loss functions that encourage sharp predictions, avoiding shot-scaling limitations of expectation-based models.
Result: Experiments on MNIST and CIFAR-10 show Yomo consistently outperforms baselines across different shot budgets and under simulations with depolarizing channels, achieving accurate single-shot inference.
Conclusion: Yomo substantially reduces the financial and computational costs of deploying QML by enabling accurate single-shot inference, thereby lowering the barrier to practical adoption of quantum machine learning.
Abstract: Quantum machine learning (QML) models conventionally rely on repeated measurements (shots) of observables to obtain reliable predictions. This dependence on large shot budgets leads to high inference cost and time overhead, which is particularly problematic as quantum hardware access is typically priced proportionally to the number of shots. In this work we propose You Only Measure Once (Yomo), a simple yet effective design that achieves accurate inference with dramatically fewer measurements, down to the single-shot regime. Yomo replaces Pauli expectation-value outputs with a probability aggregation mechanism and introduces loss functions that encourage sharp predictions. Our theoretical analysis shows that Yomo avoids the shot-scaling limitations inherent to expectation-based models, and our experiments on MNIST and CIFAR-10 confirm that Yomo consistently outperforms baselines across different shot budgets and under simulations with depolarizing channels. By enabling accurate single-shot inference, Yomo substantially reduces the financial and computational costs of deploying QML, thereby lowering the barrier to practical adoption of QML.
[405] Incomplete Data, Complete Dynamics: A Diffusion Approach
Zihan Zhou, Chenguang Wang, Hongyi Ye, Yongtao Guan, Tianshu Yu
Main category: cs.LG
TL;DR: A diffusion-based framework for learning physical systems from incomplete, irregularly sampled data by partitioning samples into observed context and unobserved query components, then training a conditional diffusion model to reconstruct missing portions.
Details
Motivation: Real-world observational data are inherently incomplete and irregularly sampled, posing challenges for existing data-driven approaches to learning physical dynamics.Method: Strategic partitioning of samples into observed context and unobserved query components, followed by training a conditional diffusion model to reconstruct missing query portions given available contexts without requiring complete data supervision.
Result: Theoretical analysis shows asymptotic convergence to true generative process; empirical results demonstrate significant outperformance over baselines on fluid flows and weather systems, especially in limited/irregular observation regimes.
Conclusion: The proposed diffusion-based framework provides an effective, theoretically principled approach for learning and imputing partially observed physical dynamics.
Abstract: Learning physical dynamics from data is a fundamental challenge in machine learning and scientific modeling. Real-world observational data are inherently incomplete and irregularly sampled, posing significant challenges for existing data-driven approaches. In this work, we propose a principled diffusion-based framework for learning physical systems from incomplete training samples. To this end, our method strategically partitions each such sample into observed context and unobserved query components through a carefully designed splitting strategy, then trains a conditional diffusion model to reconstruct the missing query portions given available contexts. This formulation enables accurate imputation across arbitrary observation patterns without requiring complete data supervision. Specifically, we provide theoretical analysis demonstrating that our diffusion training paradigm on incomplete data achieves asymptotic convergence to the true complete generative process under mild regularity conditions. Empirically, we show that our method significantly outperforms existing baselines on synthetic and real-world physical dynamics benchmarks, including fluid flows and weather systems, with particularly strong performance in limited and irregular observation regimes. These results demonstrate the effectiveness of our theoretically principled approach for learning and imputing partially observed dynamics.
[406] Discovering Association Rules in High-Dimensional Small Tabular Data
Erkan Karabulut, Daniel Daza, Paul Groth, Victoria Degeler
Main category: cs.LG
TL;DR: This paper addresses Association Rule Mining (ARM) challenges in high-dimensional data, particularly rule explosion and poor performance in low-data settings. It improves upon Aerial+ with fine-tuning approaches using tabular foundation models.
Details
Motivation: Traditional ARM methods face computational challenges in high-dimensional settings, while neurosymbolic methods like Aerial+ inherit neural network limitations in low-data regimes. The paper aims to solve ARM problems in high-dimensional, low-data scenarios common in domains like biomedicine.Method: The authors propose two fine-tuning approaches to Aerial+ using tabular foundation models to enhance rule discovery in high-dimensional, low-data settings. They empirically evaluate scalability and rule quality across five real-world datasets.
Result: Aerial+ scales one to two orders of magnitude better than state-of-the-art baselines. The proposed fine-tuning approaches significantly improve rule quality in low-data, high-dimensional scenarios.
Conclusion: The paper successfully addresses ARM challenges in high-dimensional tabular data, particularly in low-data regimes, demonstrating effective solutions through neurosymbolic methods enhanced with foundation model fine-tuning.
Abstract: Association Rule Mining (ARM) aims to discover patterns between features in datasets in the form of propositional rules, supporting both knowledge discovery and interpretable machine learning in high-stakes decision-making. However, in high-dimensional settings, rule explosion and computational overhead render popular algorithmic approaches impractical without effective search space reduction, challenges that propagate to downstream tasks. Neurosymbolic methods, such as Aerial+, have recently been proposed to address the rule explosion in ARM. While they tackle the high dimensionality of the data, they also inherit limitations of neural networks, particularly reduced performance in low-data regimes. This paper makes three key contributions to association rule discovery in high-dimensional tabular data. First, we empirically show that Aerial+ scales one to two orders of magnitude better than state-of-the-art algorithmic and neurosymbolic baselines across five real-world datasets. Second, we introduce the novel problem of ARM in high-dimensional, low-data settings, such as gene expression data from the biomedicine domain with around 18k features and 50 samples. Third, we propose two fine-tuning approaches to Aerial+ using tabular foundation models. Our proposed approaches are shown to significantly improve rule quality on five real-world datasets, demonstrating their effectiveness in low-data, high-dimensional scenarios.
[407] Beyond Slater’s Condition in Online CMDPs with Stochastic and Adversarial Constraints
Francesco Emanuele Stradi, Eleonora Fidelia Chiefari, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti
Main category: cs.LG
TL;DR: Novel algorithm for online episodic Constrained Markov Decision Processes (CMDPs) that improves state-of-the-art results in both stochastic and adversarial constraint settings, achieving sublinear regret and constraint violation without requiring Slater’s condition.
Details
Motivation: To address limitations in existing CMDP algorithms, particularly the need for Slater's condition and inability to handle settings where no strictly feasible solution exists, while providing stronger guarantees on positive constraint violation.Method: Developed a new algorithm for online episodic CMDPs that works under both stochastic (fixed unknown distributions) and adversarial (arbitrarily changing) constraints. The method achieves sublinear regret and constraint violation without relying on Slater’s condition.
Result: In stochastic regime: Õ(√T) regret and constraint violation without Slater’s condition, with guarantees on positive constraint violation. In adversarial regime: sublinear constraint violation without Slater’s condition, and sublinear α-regret with respect to unconstrained optimum.
Conclusion: The proposed algorithm significantly improves upon state-of-the-art methods, handles more challenging constraint settings, and demonstrates practical effectiveness through synthetic experiments.
Abstract: We study \emph{online episodic Constrained Markov Decision Processes} (CMDPs) under both stochastic and adversarial constraints. We provide a novel algorithm whose guarantees greatly improve those of the state-of-the-art best-of-both-worlds algorithm introduced by Stradi et al. (2025). In the stochastic regime, \emph{i.e.}, when the constraints are sampled from fixed but unknown distributions, our method achieves $\widetilde{\mathcal{O}}(\sqrt{T})$ regret and constraint violation without relying on Slater’s condition, thereby handling settings where no strictly feasible solution exists. Moreover, we provide guarantees on the stronger notion of \emph{positive} constraint violation, which does not allow to recover from large violation in the early episodes by playing strictly safe policies. In the adversarial regime, \emph{i.e.}, when the constraints may change arbitrarily between episodes, our algorithm ensures sublinear constraint violation without Slater’s condition, and achieves sublinear $\alpha$-regret with respect to the \emph{unconstrained} optimum, where $\alpha$ is a suitably defined multiplicative approximation factor. We further validate our results through synthetic experiments, showing the practical effectiveness of our algorithm.
[408] Probability Signature: Bridging Data Semantics and Embedding Structure in Language Models
Junjie Yao, Zhi-Qin John Xu
Main category: cs.LG
TL;DR: This paper investigates how data distribution shapes embedding structures in language models through probability signatures that reflect semantic relationships.
Details
Motivation: To understand the mechanisms driving the formation of ordered embedding structures in language models, particularly why embeddings of related concepts (like digits) exhibit semantic organization.Method: Proposed probability signatures to capture semantic relationships, conducted experiments on composite addition tasks using linear models and feedforward networks, analyzed gradient flow dynamics theoretically, and extended analysis to Qwen2.5 LLMs trained on Pile corpus subsets.
Result: Probability signatures significantly influence embedding structures and are faithfully aligned with them, particularly in capturing strong pairwise similarities among embeddings.
Conclusion: The work uncovers how data distribution guides embedding structure formation, establishing a novel understanding of the relationship between embedding organization and semantic patterns.
Abstract: The embedding space of language models is widely believed to capture the semantic relationships; for instance, embeddings of digits often exhibit an ordered structure that corresponds to their natural sequence. However, the mechanisms driving the formation of such structures remain poorly understood. In this work, we interpret the embedding structures via the data distribution. We propose a set of probability signatures that reflect the semantic relationships among tokens. Through experiments on the composite addition tasks using the linear model and feedforward network, combined with theoretical analysis of gradient flow dynamics, we reveal that these probability signatures significantly influence the embedding structures. We further generalize our analysis to large language models (LLMs) by training the Qwen2.5 architecture on the subsets of the Pile corpus. Our results show that the probability signatures are faithfully aligned with the embedding structures, particularly in capturing strong pairwise similarities among embeddings. Our work uncovers the mechanism of how data distribution guides the formation of embedding structures, establishing a novel understanding of the relationship between embedding organization and semantic patterns.
[409] Generative Model Inversion Through the Lens of the Manifold Hypothesis
Xiong Peng, Bo Han, Fengfei Yu, Tongliang Liu, Feng Liu, Mingyuan Zhou
Main category: cs.LG
TL;DR: The paper analyzes why generative model inversion attacks (MIAs) are effective, finding that they work by projecting noisy gradients onto the generator manifold’s tangent space. The authors hypothesize that models become more vulnerable when loss gradients align with the generator manifold, and validate this with a novel training objective and training-free approach.
Details
Motivation: To understand why generative model inversion attacks are so effective at reconstructing private training data from trained models, and to explore the relationship between loss gradient alignment with the generator manifold and model vulnerability to MIAs.Method: The authors analyze gradients of inversion loss, examine their projection onto generator manifold tangent spaces, design a novel training objective to promote gradient-manifold alignment, and introduce a training-free approach to enhance this alignment during inversion.
Result: Empirical measurements show loss gradients in standard supervised models have large angular deviations from data manifolds. The proposed methods validate that improved gradient-manifold alignment increases MIA effectiveness and leads to consistent improvements over state-of-the-art generative MIAs.
Conclusion: Model vulnerability to inversion attacks is directly related to how well loss gradients align with the generator manifold, and both training-based and training-free methods can exploit this alignment to enhance attack effectiveness.
Abstract: Model inversion attacks (MIAs) aim to reconstruct class-representative samples from trained models. Recent generative MIAs utilize generative adversarial networks to learn image priors that guide the inversion process, yielding reconstructions with high visual quality and strong fidelity to the private training data. To explore the reason behind their effectiveness, we begin by examining the gradients of inversion loss with respect to synthetic inputs, and find that these gradients are surprisingly noisy. Further analysis reveals that generative inversion implicitly denoises these gradients by projecting them onto the tangent space of the generator manifold, filtering out off-manifold components while preserving informative directions aligned with the manifold. Our empirical measurements show that, in models trained with standard supervision, loss gradients often exhibit large angular deviations from the data manifold, indicating poor alignment with class-relevant directions. This observation motivates our central hypothesis: models become more vulnerable to MIAs when their loss gradients align more closely with the generator manifold. We validate this hypothesis by designing a novel training objective that explicitly promotes such alignment. Building on this insight, we further introduce a training-free approach to enhance gradient-manifold alignment during inversion, leading to consistent improvements over state-of-the-art generative MIAs.
[410] An Improved Time Series Anomaly Detection by Applying Structural Similarity
Tiejun Wang, Rui Wang, Xudong Mou, Mengyuan Ma, Tianyu Wo, Renyu Yang, Xudong Liu
Main category: cs.LG
TL;DR: StrAD is a novel structure-enhanced anomaly detection approach that incorporates structural information into reconstruction-based methods to better detect both point-wise and pattern-wise anomalies in time series data.
Details
Motivation: Current reconstruction-based anomaly detection methods only use point-by-point distance measures, ignoring structural characteristics of time series and failing to detect complex pattern-wise anomalies. The scarcity of anomaly labels and high labeling costs make unsupervised approaches necessary.Method: StrAD enriches the optimization objective by incorporating structural information (trend, seasonality, and shape) into the reconstruction process. It ensures alignment between original and reconstructed data in terms of structural features through a pluggable structure-aware optimization mechanism.
Result: Experimental results show that StrAD improves the performance of state-of-the-art reconstruction-based models across five real-world anomaly detection datasets.
Conclusion: The proposed structure-aware optimization mechanism effectively enhances model sensitivity to both point-wise and pattern-wise anomalies, making it a valuable pluggable component for any reconstruction-based anomaly detection method.
Abstract: Effective anomaly detection in time series is pivotal for modern industrial applications and financial systems. Due to the scarcity of anomaly labels and the high cost of manual labeling, reconstruction-based unsupervised approaches have garnered considerable attention. However, accurate anomaly detection remains an unsettled challenge, since the optimization objectives of reconstruction-based methods merely rely on point-by-point distance measures, ignoring the potential structural characteristics of time series and thus failing to tackle complex pattern-wise anomalies. In this paper, we propose StrAD, a novel structure-enhanced anomaly detection approach to enrich the optimization objective by incorporating structural information hidden in the time series and steering the data reconstruction procedure to better capture such structural features. StrAD accommodates the trend, seasonality, and shape in the optimization objective of the reconstruction model to learn latent structural characteristics and capture the intrinsic pattern variation of time series. The proposed structure-aware optimization objective mechanism can assure the alignment between the original data and the reconstructed data in terms of structural features, thereby keeping consistency in global fluctuation and local characteristics. The mechanism is pluggable and applicable to any reconstruction-based methods, enhancing the model sensitivity to both point-wise anomalies and pattern-wise anomalies. Experimental results show that StrAD improves the performance of state-of-the-art reconstruction-based models across five real-world anomaly detection datasets.
[411] FairEquityFL – A Fair and Equitable Client Selection in Federated Learning for Heterogeneous IoV Networks
Fahmida Islam, Adnan Mahmood, Noorain Mukhtiar, Kasun Eranda Wijethilake, Quan Z. Sheng
Main category: cs.LG
TL;DR: FairEquityFL is a federated learning framework that ensures fairness in client selection for Internet of Vehicles applications, incorporating a sampling equalizer and outlier detection mechanism.
Details
Motivation: Existing FL frameworks lack fairness considerations in client selection for dynamic and heterogeneous IoV environments, where only a subset of clients can participate in each training round.Method: The framework introduces a sampling equalizer module within the selector component to ensure equitable participation opportunities. It also includes an outlier detection mechanism to identify malicious clients based on model performance fluctuations.
Result: Evaluation on FEMNIST dataset shows that FairEquityFL significantly outperforms baseline models in terms of performance.
Conclusion: FairEquityFL successfully addresses fairness challenges in FL client selection for IoV environments while maintaining security through malicious client detection.
Abstract: Federated Learning (FL) has been extensively employed for a number of applications in machine learning, i.e., primarily owing to its privacy preserving nature and efficiency in mitigating the communication overhead. Internet of Vehicles (IoV) is one of the promising applications, wherein FL can be utilized to train a model more efficiently. Since only a subset of the clients can participate in each FL training round, challenges arise pertinent to fairness in the client selection process. Over the years, a number of researchers from both academia and industry have proposed numerous FL frameworks. However, to the best of our knowledge, none of them have employed fairness for FL-based client selection in a dynamic and heterogeneous IoV environment. Accordingly, in this paper, we envisage a FairEquityFL framework to ensure an equitable opportunity for all the clients to participate in the FL training process. In particular, we have introduced a sampling equalizer module within the selector component for ensuring fairness in terms of fair collaboration opportunity for all the clients in the client selection process. The selector is additionally responsible for both monitoring and controlling the clients’ participation in each FL training round. Moreover, an outlier detection mechanism is enforced for identifying malicious clients based on the model performance in terms of considerable fluctuation in either accuracy or loss minimization. The selector flags suspicious clients and temporarily suspend such clients from participating in the FL training process. We further evaluate the performance of FairEquityFL on a publicly available dataset, FEMNIST. Our simulation results depict that FairEquityFL outperforms baseline models to a considerable extent.
[412] Staying on the Manifold: Geometry-Aware Noise Injection
Albert Kjøller Jacobsen, Johanna Marie Gegenfurtner, Georgios Arvanitidis
Main category: cs.LG
TL;DR: The paper proposes geometry-aware input noise methods that account for the underlying manifold structure of data, improving generalization and robustness compared to traditional ambient noise approaches.
Details
Motivation: Previous research on input perturbation during training only considered ambient noise without accounting for the data's manifold structure, which limits effectiveness on complex datasets with curved manifolds.Method: Proposed methods include: 1) Projecting ambient Gaussian noise onto tangent spaces and mapping to manifolds via geodesics, 2) Brownian motion noise moving along manifolds, 3) Extension to learned data manifolds.
Result: Geometry-aware noise improves generalization and robustness to hyperparameter selection on highly curved manifolds, while performing at least as well as no-noise training on simpler manifolds.
Conclusion: Accounting for the underlying manifold structure when adding input noise provides superior regularization compared to traditional ambient noise approaches, particularly for complex datasets.
Abstract: It has been shown that perturbing the input during training implicitly regularises the gradient of the learnt function, leading to smoother models and enhancing generalisation. However, previous research mostly considered the addition of ambient noise in the input space, without considering the underlying structure of the data. In this work, we propose several methods of adding geometry-aware input noise that accounts for the lower dimensional manifold the input space inhabits. We start by projecting ambient Gaussian noise onto the tangent space of the manifold. In a second step, the noise sample is mapped on the manifold via the associated geodesic curve. We also consider Brownian motion noise, which moves in random steps along the manifold. We show that geometry-aware noise leads to improved generalization and robustness to hyperparameter selection on highly curved manifolds, while performing at least as well as training without noise on simpler manifolds. Our proposed framework extends to learned data manifolds.
[413] Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment
Deokjae Lee, Hyun Oh Song
Main category: cs.LG
TL;DR: Q-Palette introduces a versatile collection of fractional-bit quantizers and a mixed-scheme quantization framework for weight-only post-training quantization of large language models, achieving near-optimal performance with efficient implementation.
Details
Motivation: Weight-only PTQ is crucial for reducing memory footprint and latency of LLM inference, especially in memory-bound scenarios like edge devices. Irregular weight distributions with heavy-tailed outliers in LLMs complicate quantization, motivating methods that transform weights into near-Gaussian distributions.Method: The paper first derives the information-theoretically optimal bit allocation for Gaussianized weights, then introduces Q-Palette - a collection of fractional-bit quantizers including trellis-coded, vector, and scalar quantizers with optimized CUDA kernels. It also proposes a mixed-scheme quantization framework that jointly optimizes quantizer choices and layer fusion decisions.
Result: The method achieves near-optimal quantization performance by using fine-grained fractional-bit quantizers that approach the Gaussian distortion-rate bound, with efficient implementation across various bitwidths.
Conclusion: Q-Palette provides a practical solution for weight-only PTQ that bridges theoretical insights with practical implementation, enabling efficient LLM inference on resource-constrained devices.
Abstract: We study weight-only post-training quantization (PTQ), which quantizes the weights of a large language model (LLM) without retraining, using little or no calibration data. Weight-only PTQ is crucial for reducing the memory footprint and latency of LLM inference, especially in memory-bound, small-batch inference scenarios, such as personalized inference on edge devices. Despite its importance, irregular weight distributions with heavy-tailed outliers in LLMs complicate quantization, recently motivating rotation-based methods that transform weights into near-Gaussian distributions, which are more regular with fewer outliers, thereby reducing quantization error. In this work, we first derive the information-theoretically optimal bit allocation for Gaussianized weights under given bit budgets, revealing that fine-grained fractional-bit quantizers approaching the Gaussian distortion-rate bound are essential to achieve near-optimal quantization performance. To bridge this theoretical insight and practical implementation, we introduce Q-Palette, a versatile collection of fractional-bit quantizers that range from trellis-coded quantizers offering near-optimal distortion to simpler vector and scalar quantizers optimized for faster inference, all efficiently implemented with optimized CUDA kernels across various bitwidths. Furthermore, leveraging Q-Palette as a foundational component, we propose a novel mixed-scheme quantization framework, jointly optimizing quantizer choices and layer fusion decisions given resource constraints. The code is available at https://github.com/snu-mllab/Q-Palette.
[414] Practical do-Shapley Explanations with Estimand-Agnostic Causal Inference
Álvaro Parafita, Tomas Garriga, Axel Brando, Francisco J. Cazorla
Main category: cs.LG
TL;DR: The paper proposes estimand-agnostic approaches to make do-SHAP feasible on complex graphs by enabling estimation of any identifiable query from a single model, along with computational acceleration and methods for explaining inaccessible Data Generating Processes.
Details
Motivation: SHAP is popular but ignores causal structure, while do-SHAP addresses this but is hindered by its reliance on estimands, making it impractical for complex applications.Method: Developed estimand-agnostic approaches that allow estimation of any identifiable query from one model, a novel algorithm for computational acceleration, and methods to explain inaccessible Data Generating Processes.
Result: Demonstrated successful estimation and computational performance, validated on two real-world datasets showing reliable explanations.
Conclusion: The proposed approach makes do-SHAP practical for complex graphs and provides reliable causal explanations while overcoming previous limitations.
Abstract: Among explainability techniques, SHAP stands out as one of the most popular, but often overlooks the causal structure of the problem. In response, do-SHAP employs interventional queries, but its reliance on estimands hinders its practical application. To address this problem, we propose the use of estimand-agnostic approaches, which allow for the estimation of any identifiable query from a single model, making do-SHAP feasible on complex graphs. We also develop a novel algorithm to significantly accelerate its computation at a negligible cost, as well as a method to explain inaccessible Data Generating Processes. We demonstrate the estimation and computational performance of our approach, and validate it on two real-world datasets, highlighting its potential in obtaining reliable explanations.
[415] Time-adaptive HénonNets for separable Hamiltonian systems
Konrad Janik, Peter Benner
Main category: cs.LG
TL;DR: This paper introduces T-HénonNets, a novel neural network architecture that extends HénonNets to handle adaptive time steps for learning time-adaptive symplectic integrators, particularly for irregularly sampled Hamiltonian systems.
Details
Motivation: Existing machine learning methods for learning symplectic integrators (like SympNets and HénonNets) require training data with fixed step sizes, but real-world measurement data is often sampled irregularly on non-equidistant time grids. This limitation motivates the development of methods that can handle adaptive time steps.Method: The authors propose T-HénonNets, a symplectic neural network architecture that extends HénonNets to handle adaptive time steps. They also extend this architecture to non-autonomous Hamiltonian systems and provide universal approximation theorems for separable Hamiltonian systems.
Result: The paper presents theoretical approximation capabilities for the proposed architectures and performs numerical experiments to investigate these capabilities, though it acknowledges difficulties in handling non-separable Hamiltonian systems.
Conclusion: T-HénonNets successfully extend the capabilities of HénonNets to handle adaptive time steps, providing a solution for learning symplectic integrators from irregularly sampled data, with proven approximation guarantees for separable Hamiltonian systems.
Abstract: Measurement data is often sampled irregularly, i.e., not on equidistant time grids. This is also true for Hamiltonian systems. However, existing machine learning methods, which learn symplectic integrators, such as SympNets [1] and H'enonNets [2] still require training data generated by fixed step sizes. To learn time-adaptive symplectic integrators, an extension to SympNets called TSympNets is introduced in [3]. The aim of this work is to do a similar extension for H'enonNets. We propose a novel neural network architecture called T-H'enonNets, which is symplectic by design and can handle adaptive time steps. We also extend the T-H'enonNet architecture to non-autonomous Hamiltonian systems. Additionally, we provide universal approximation theorems for both new architectures for separable Hamiltonian systems and discuss why it is difficult to handle non-separable Hamiltonian systems with the proposed methods. To investigate these theoretical approximation capabilities, we perform different numerical experiments.
[416] Beyond Sharp Minima: Robust LLM Unlearning via Feedback-Guided Multi-Point Optimization
Wenhan Wu, Zheyuan Liu, Chongyang Gao, Ren Wang, Kaize Ding
Main category: cs.LG
TL;DR: Current LLM unlearning methods are vulnerable to relearning attacks that can recover supposedly erased knowledge. The root cause is that conventional methods drive parameters to sharp minima, making knowledge recoverable through minimal fine-tuning. StableUN addresses this with a bi-level optimization framework that finds more stable parameter regions.
Details
Motivation: To address the security vulnerability in LLM unlearning where "forgotten" information remains recoverable through relearning attacks, exposing a critical robustness gap between apparent unlearning and actual knowledge removal.Method: StableUN - a bi-level feedback-guided optimization framework that explicitly seeks stable parameter regions via neighborhood-aware optimization. It integrates forgetting feedback (using adversarial perturbations to probe parameter neighborhoods) with remembering feedback to preserve model utility, aligning objectives through gradient projection.
Result: Experiments on WMDP and MUSE benchmarks demonstrate that StableUN is significantly more robust against both relearning and jailbreaking attacks while maintaining competitive utility performance.
Conclusion: The proposed StableUN framework effectively addresses the security vulnerability in LLM unlearning by finding more stable parameter regions, making erased knowledge truly unrecoverable while preserving model utility.
Abstract: Current LLM unlearning methods face a critical security vulnerability that undermines their fundamental purpose: while they appear to successfully remove sensitive or harmful knowledge, this ``forgotten" information remains precariously recoverable through relearning attacks. We identify that the root cause is that conventional methods optimizing the forgetting loss at individual data points will drive model parameters toward sharp minima in the loss landscape. In these unstable regions, even minimal parameter perturbations can drastically alter the model’s behaviors. Consequently, relearning attacks exploit this vulnerability by using just a few fine-tuning samples to navigate the steep gradients surrounding these unstable regions, thereby rapidly recovering knowledge that was supposedly erased. This exposes a critical robustness gap between apparent unlearning and actual knowledge removal. To address this issue, we propose StableUN, a bi-level feedback-guided optimization framework that explicitly seeks more stable parameter regions via neighborhood-aware optimization. It integrates forgetting feedback, which uses adversarial perturbations to probe parameter neighborhoods, with remembering feedback to preserve model utility, aligning the two objectives through gradient projection. Experiments on WMDP and MUSE benchmarks demonstrate that our method is significantly more robust against both relearning and jailbreaking attacks while maintaining competitive utility performance.
[417] A HyperGraphMamba-Based Multichannel Adaptive Model for ncRNA Classification
Xin An, Ruijie Li, Qiao Ning, Hui Li, Qian Ma, Shikai Guo
Main category: cs.LG
TL;DR: HGMamba-ncRNA is a HyperGraphMamba-based multichannel adaptive model that integrates sequence, structure, and expression features of non-coding RNAs to enhance classification performance, outperforming state-of-the-art methods.
Details
Motivation: Accurate classification of non-coding RNAs (ncRNAs) is essential for functional annotation and disease diagnosis, but existing methods have limitations in feature extraction depth and multimodal fusion.Method: The model uses: 1) MKC-L (parallel Multi-scale Convolution and LSTM) for sequence features, 2) MSGraphTransformer for secondary structure features, 3) CPKAN (Chebyshev Polynomial-based Kolmogorov-Arnold Network) for expression features, and 4) HyperGraphMamba with virtual nodes for multimodal integration.
Result: Experiments on three public datasets show HGMamba-ncRNA consistently outperforms state-of-the-art methods in accuracy and other metrics, demonstrating robustness, effectiveness, and strong transferability.
Conclusion: HGMamba-ncRNA offers a novel and reliable strategy for complex ncRNA functional classification, providing enhanced multimodal feature integration and superior performance compared to existing approaches.
Abstract: Non-coding RNAs (ncRNAs) play pivotal roles in gene expression regulation and the pathogenesis of various diseases. Accurate classification of ncRNAs is essential for functional annotation and disease diagnosis. To address existing limitations in feature extraction depth and multimodal fusion, we propose HGMamba-ncRNA, a HyperGraphMamba-based multichannel adaptive model, which integrates sequence, secondary structure, and optionally available expression features of ncRNAs to enhance classification performance. Specifically, the sequence of ncRNA is modeled using a parallel Multi-scale Convolution and LSTM architecture (MKC-L) to capture both local patterns and long-range dependencies of nucleotides. The structure modality employs a multi-scale graph transformer (MSGraphTransformer) to represent the multi-level topological characteristics of ncRNA secondary structures. The expression modality utilizes a Chebyshev Polynomial-based Kolmogorov-Arnold Network (CPKAN) to effectively model and interpret high-dimensional expression profiles. Finally, by incorporating virtual nodes to facilitate efficient and comprehensive multimodal interaction, HyperGraphMamba is proposed to adaptively align and integrate multichannel heterogeneous modality features. Experiments conducted on three public datasets demonstrate that HGMamba-ncRNA consistently outperforms state-of-the-art methods in terms of accuracy and other metrics. Extensive empirical studies further confirm the model’s robustness, effectiveness, and strong transferability, offering a novel and reliable strategy for complex ncRNA functional classification. Code and datasets are available at https://anonymous.4open.science/r/HGMamba-ncRNA-94D0.
[418] Energy Use of AI Inference: Efficiency Pathways and Test-Time Compute
Felipe Oviedo, Fiodar Kazhamiaka, Esha Choukse, Allen Kim, Amy Luers, Melanie Nakagawa, Ricardo Bianchini, Juan M. Lavista Ferres
Main category: cs.LG
TL;DR: This paper introduces a bottom-up methodology to estimate per-query energy consumption for large-scale LLM systems, finding that current public estimates overstate energy use by 4-20x and identifying significant efficiency gains possible through model, serving platform, and hardware optimizations.
Details
Motivation: As AI inference scales to billions of queries with increasing token demand from reasoning and agentic workflows, reliable energy use estimates are crucial for capacity planning, emissions accounting, and efficiency prioritization. Current public estimates are inconsistent and overstate actual energy consumption.Method: The authors developed a bottom-up methodology based on token throughput to estimate per-query energy for large-scale LLM systems, considering GPU utilization and PUE constraints for models running on H100 nodes under realistic workloads.
Result: For frontier-scale models (>200B parameters), median energy per query is 0.34 Wh (IQR: 0.18-0.67). With 15x more tokens per query, energy rises 13x to 4.32 Wh. Efficiency interventions can deliver 8-20x reductions in energy per query through model, serving platform, and hardware improvements.
Conclusion: Targeted efficiency interventions can significantly reduce LLM energy consumption, with potential to match web search energy footprints at scale. This mirrors historical data center efficiency gains during internet and cloud expansion, suggesting AI energy growth can be tempered through similar optimization strategies.
Abstract: As AI inference scales to billions of queries and emerging reasoning and agentic workflows increase token demand, reliable estimates of per-query energy use are increasingly important for capacity planning, emissions accounting, and efficiency prioritization. Many public estimates are inconsistent and overstate energy use, because they extrapolate from limited benchmarks and fail to reflect efficiency gains achievable at scale. In this perspective, we introduce a bottom-up methodology to estimate the per-query energy of large-scale LLM systems based on token throughput. For models running on an H100 node under realistic workloads, GPU utilization and PUE constraints, we estimate a median energy per query of 0.34 Wh (IQR: 0.18-0.67) for frontier-scale models (>200 billion parameters). These results are consistent with measurements using production-scale configurations and show that non-production estimates and assumptions can overstate energy use by 4-20x. Extending to test-time scaling scenarios with 15x more tokens per typical query, the median energy rises 13x to 4.32 Wh, indicating that targeting efficiency in this regime will deliver the largest fleet-wide savings. We quantify achievable efficiency gains at the model, serving platform, and hardware levels, finding individual median reductions of 1.5-3.5x in energy per query, while combined advances can plausibly deliver 8-20x reductions. To illustrate the system-level impact, we estimate the baseline daily energy use of a deployment serving 1 billion queries to be 0.8 GWh/day. If 10% are long queries, demand could grow to 1.8 GWh/day. With targeted efficiency interventions, it falls to 0.9 GWh/day, similar to the energy footprint of web search at that scale. This echoes how data centers historically tempered energy growth through efficiency gains during the internet and cloud build-up.
[419] Failure Modes of Maximum Entropy RLHF
Ömer Veysel Çağatan, Barış Akgün
Main category: cs.LG
TL;DR: SimPO is theoretically derived from Maximum Entropy RL with length-normalized temperature, but Maximum Entropy RL struggles with overoptimization and unstable KL dynamics in online RLHF settings unlike offline SimPO.
Details
Motivation: To investigate whether Maximum Entropy RL can achieve similar strong performance as SimPO in online RLHF settings, given SimPO's success in offline preference optimization.Method: Theoretical derivation of SimPO from Maximum Entropy RL with length-normalized temperature, followed by experimental investigation of Maximum Entropy RL in online RLHF settings.
Result: Maximum Entropy RL consistently exhibits overoptimization and unstable KL dynamics in online settings, even at low learning rates, unlike stable KL-constrained methods. Entropy regularization fails to prevent reward hacking and correlates with overoptimization.
Conclusion: Reference-free approaches face distinct challenges in online vs offline preference learning, with SimPO succeeding offline while Maximum Entropy RL struggles online, suggesting fundamental differences between these settings.
Abstract: In this paper, we show that Simple Preference Optimization (SimPO) can be derived as Maximum Entropy Reinforcement Learning with length-normalized temperature, providing a theoretical foundation for this reference-free method. Motivated by SimPO’s strong performance in offline preference optimization, we investigate whether Maximum Entropy RL can achieve similar results in online RLHF settings. Our experiments find that Maximum Entropy RL consistently exhibits overoptimization and unstable KL dynamics, even at very low learning rates. Unlike KL-constrained methods that maintain stable training, entropy regularization fails to prevent reward hacking and appears to correlate with overoptimization. Lastly, we discuss possible explanations for why SimPO succeeds in offline settings while Maximum Entropy RL struggles in online scenarios. Our findings suggest that reference-free approaches may face distinct challenges when applied to online or offline preference learning.
[420] Dynamic Lagging for Time-Series Forecasting in E-Commerce Finance: Mitigating Information Loss with A Hybrid ML Architecture
Abhishek Sharma, Anat Parush, Sumit Wadhwa, Amihai Savir, Anne Guinard, Prateek Srivastava
Main category: cs.LG
TL;DR: A hybrid forecasting framework combining dynamic lagged feature engineering, adaptive rolling-window representations, classical statistical models, and ensemble learners for e-commerce finance forecasting under sparse and irregular data conditions.
Details
Motivation: E-commerce finance forecasting faces challenges from irregular invoice schedules, payment deferrals, user-specific behavioral variability, sparse datasets, and short historical windows, which limit conventional time-series methods and cause deep learning models to deteriorate under partial observability.Method: Integrates dynamic lagged feature engineering, adaptive rolling-window representations, invoice-level behavioral modeling, structured lag of support data, and custom stability-aware loss functions with classical statistical models and ensemble learners.
Result: Achieves approximately 5% reduction in MAPE compared to baseline models, enhances forecast stability over quarterly horizons, and strengthens feature target correlation by capturing both short- and long-term patterns.
Conclusion: Combining structured lagging, invoice-level closure modeling, and behavioral insights significantly advances predictive accuracy in sparse financial time-series forecasting, translating into substantial financial savings.
Abstract: Accurate forecasting in the e-commerce finance domain is particularly challenging due to irregular invoice schedules, payment deferrals, and user-specific behavioral variability. These factors, combined with sparse datasets and short historical windows, limit the effectiveness of conventional time-series methods. While deep learning and Transformer-based models have shown promise in other domains, their performance deteriorates under partial observability and limited historical data. To address these challenges, we propose a hybrid forecasting framework that integrates dynamic lagged feature engineering and adaptive rolling-window representations with classical statistical models and ensemble learners. Our approach explicitly incorporates invoice-level behavioral modeling, structured lag of support data, and custom stability-aware loss functions, enabling robust forecasts in sparse and irregular financial settings. Empirical results demonstrate an approximate 5% reduction in MAPE compared to baseline models, translating into substantial financial savings. Furthermore, the framework enhances forecast stability over quarterly horizons and strengthens feature target correlation by capturing both short- and long-term patterns, leveraging user profile attributes, and simulating upcoming invoice behaviors. These findings underscore the value of combining structured lagging, invoice-level closure modeling, and behavioral insights to advance predictive accuracy in sparse financial time-series forecasting.
[421] Predictive Coding-based Deep Neural Network Fine-tuning for Computationally Efficient Domain Adaptation
Matteo Cardoni, Sam Leroux
Main category: cs.LG
TL;DR: A hybrid training method combining Backpropagation and Predictive Coding for efficient on-device domain adaptation in dynamic environments
Details
Motivation: Single static models are insufficient for dynamic real-world environments where input data distributions change due to factors like sensor drift or lighting variations, necessitating continual model adaptationMethod: Start with offline training using Backpropagation for high initial performance, then use Predictive Coding for online adaptation to recover accuracy lost from input distribution shifts
Result: Experimental results on MNIST and CIFAR-10 datasets show effective adaptation with reduced computational overhead
Conclusion: The hybrid strategy offers a promising solution for maintaining model performance in dynamic environments, particularly suitable for resource-constrained edge devices or neuromorphic accelerators
Abstract: As deep neural networks are increasingly deployed in dynamic, real-world environments, relying on a single static model is often insufficient. Changes in input data distributions caused by sensor drift or lighting variations necessitate continual model adaptation. In this paper, we propose a hybrid training methodology that enables efficient on-device domain adaptation by combining the strengths of Backpropagation and Predictive Coding. The method begins with a deep neural network trained offline using Backpropagation to achieve high initial performance. Subsequently, Predictive Coding is employed for online adaptation, allowing the model to recover accuracy lost due to shifts in the input data distribution. This approach leverages the robustness of Backpropagation for initial representation learning and the computational efficiency of Predictive Coding for continual learning, making it particularly well-suited for resource-constrained edge devices or future neuromorphic accelerators. Experimental results on the MNIST and CIFAR-10 datasets demonstrate that this hybrid strategy enables effective adaptation with a reduced computational overhead, offering a promising solution for maintaining model performance in dynamic environments.
[422] Extended Low-Rank Approximation Accelerates Learning of Elastic Response in Heterogeneous Materials
Prabhat Karmakar, Sayan Gupta, Ilaksh Adlakha
Main category: cs.LG
TL;DR: xLRA is a compact tensor decomposition framework that efficiently predicts local elastic response from microstructures using minimal data (only 5% training) and low computational cost (6 orders of magnitude fewer operations than contemporary methods).
Details
Motivation: Predicting mechanical response from microstructure is challenging due to high-dimensional features and computational costs of physics-based simulations. Current data-driven approaches require large datasets, motivating development of more efficient methods.Method: Extended Low Rank Approximation (xLRA) uses canonical polyadic tensor decomposition to map high-dimensional microstructural information to local elastic response by adaptively incorporating higher rank terms with maximum rank of 4.
Result: xLRA accurately predicts local elastic strain fields in porous microstructures, achieves accurate predictions with only 5% training data, demonstrates transferability across material systems, and outperforms contemporary methods in accuracy, generalizability, and computational efficiency.
Conclusion: xLRA provides an efficient framework for predicting elastic response from microstructures, enabling scalable mapping of structure-property linkages with significant data and computational efficiency advantages.
Abstract: Predicting how the microstructure governs the mechanical response of heterogeneous materials is essential for optimizing design and performance. Yet this task remains difficult due to the complex, high dimensional nature of microstructural features. Relying on physics based simulations to probe the microstructural space is computationally prohibitive. This motivates the development of computational tools to efficiently learn structure property linkages governing mechanical behavior. While contemporary data driven approaches offer new possibilities, they often require large datasets. To address this challenge, this work presents the Extended Low Rank Approximation (xLRA), a framework that employs canonical polyadic tensor decomposition. It efficiently maps high dimensional microstructural information to the local elastic response by adaptively incorporating higher rank terms. xLRA accurately predicts the local elastic strain fields in porous microstructures, requiring a maximum rank of only 4. The compact formulation of xLRA achieves accurate predictions when trained on just 5% of the dataset, demonstrating significant data efficiency. Moreover, xLRA proves transferability by delivering results across representative material systems, including two phase composites and single and dual phase polycrystals. Despite being compact, xLRA retains essential microstructural details, enabling accurate predictions on unseen microstructures. Benchmarking shows that xLRA outperforms contemporary methods in predictive accuracy, generalizability, and computational efficiency, while requiring 6 orders of magnitude fewer floating point operations. In summary, xLRA provides an efficient framework for predicting the elastic response from microstructures, enabling scalable mapping of structure property linkages.
[423] PGCLODA: Prompt-Guided Graph Contrastive Learning for Oligopeptide-Infectious Disease Association Prediction
Dayu Tan, Jing Chen, Xiaoping Zhou, Yansen Su, Chunhou Zheng
Main category: cs.LG
TL;DR: PGCLODA is a prompt-guided graph-based contrastive learning framework that predicts associations between oligopeptides and infectious diseases using a tripartite graph with structural and semantic information, achieving state-of-the-art performance.
Details
Motivation: Infectious diseases threaten public health, but computational models for predicting oligopeptide-disease associations are scarce despite oligopeptides' advantages as antimicrobial candidates.Method: Constructs tripartite graph with oligopeptides, microbes, and diseases; uses prompt-guided graph augmentation for contrastive learning; employs dual encoder (GCN + Transformer) for local/global features; MLP classifier for final prediction.
Result: PGCLODA outperforms state-of-the-art models in AUROC, AUPRC, and accuracy on benchmark dataset; ablation studies confirm module contributions; case studies validate generalization and biological relevance.
Conclusion: The framework provides valuable insights for mechanism-driven discovery and oligopeptide-based drug development, with source code publicly available.
Abstract: Infectious diseases continue to pose a serious threat to public health, underscoring the urgent need for effective computational approaches to screen novel anti-infective agents. Oligopeptides have emerged as promising candidates in antimicrobial research due to their structural simplicity, high bioavailability, and low susceptibility to resistance. Despite their potential, computational models specifically designed to predict associations between oligopeptides and infectious diseases remain scarce. This study introduces a prompt-guided graph-based contrastive learning framework (PGCLODA) to uncover potential associations. A tripartite graph is constructed with oligopeptides, microbes, and diseases as nodes, incorporating both structural and semantic information. To preserve critical regions during contrastive learning, a prompt-guided graph augmentation strategy is employed to generate meaningful paired views. A dual encoder architecture, integrating Graph Convolutional Network (GCN) and Transformer, is used to jointly capture local and global features. The fused embeddings are subsequently input into a multilayer perceptron (MLP) classifier for final prediction. Experimental results on a benchmark dataset indicate that PGCLODA consistently outperforms state-of-the-art models in AUROC, AUPRC, and accuracy. Ablation and hyperparameter studies confirm the contribution of each module. Case studies further validate the generalization ability of PGCLODA and its potential to uncover novel, biologically relevant associations. These findings offer valuable insights for mechanism-driven discovery and oligopeptide-based drug development. The source code of PGCLODA is available online at https://github.com/jjnlcode/PGCLODA.
[424] When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity
Benjamin Feuer, Chiung-Yi Tseng, Astitwa Sarthak Lathe, Oussama Elachqar, John P Dickerson
Main category: cs.LG
TL;DR: LLM-judged benchmarks have design flaws that introduce noise into rankings. The paper introduces two diagnostic tools (schematic adherence and psychometric validity) to quantify unexplained variance and ranking uncertainty, revealing severe issues in popular benchmarks like Arena-Hard Auto.
Details
Motivation: To address the failure modes in LLM-judged benchmarks that produce high-confidence rankings that are largely noise due to lack of tight objectives and verifiable constructions.Method: Introduces schematic adherence (quantifies how much verdict is explained by evaluation schema) and psychometric validity (aggregates internal consistency and discriminant validity signals) to diagnose benchmark issues.
Result: Found severe schema incoherence (90%+ unexplained variance for DeepSeek-R1-32B) and factor collapse (correlations >0.93) in Arena-Hard Auto. ELO-style aggregation masks genuine ranking uncertainty.
Conclusion: Highlights design failures undermining validity and offers principles for building better-scoped, reliability-aware LLM-judged benchmarks.
Abstract: LLM-judged benchmarks are increasingly used to evaluate complex model behaviors, yet their design introduces failure modes absent in conventional ground-truth based benchmarks. We argue that without tight objectives and verifiable constructions, benchmark rankings can produce high-confidence rankings that are in fact largely noise. We introduce two mechanisms to diagnose these issues. Schematic adherence quantifies how much of a judge’s overall verdict is explained by the explicit evaluation schema, revealing unexplained variance when judges deviate from their own rubric. Psychometric validity aggregates internal consistency and discriminant validity signals to quantify irreducible uncertainty in any benchmarking run. Applying these tools to Arena-Hard Auto, we find severe schema incoherence and factor collapse across popular judges: for example, unexplained variance exceeding 90 percent for DeepSeek-R1-32B and factor correlations above 0.93 for most criteria. We also show that the ELO-style aggregation used by Arena-Hard Auto collapses and masks genuine ranking uncertainty. Our results highlight design failures that undermine validity and offer actionable principles for building better-scoped, reliability-aware LLM-judged benchmarks. We release our code at https://anonymous.4open.science/r/judgment-to-noise-947D/README.md
[425] Video models are zero-shot learners and reasoners
Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, Robert Geirhos
Main category: cs.LG
TL;DR: Veo 3 demonstrates emergent zero-shot capabilities in visual tasks, suggesting video models are evolving into general-purpose vision foundation models similar to LLMs’ trajectory.
Details
Motivation: To explore whether video models can follow the same trajectory as LLMs towards general-purpose understanding, moving from task-specific models to unified foundation models.Method: Using Veo 3, a generative video model trained on web-scale data, to test zero-shot performance on various visual tasks without explicit training.
Result: Veo 3 successfully solves diverse tasks including object segmentation, edge detection, image editing, physical property understanding, affordance recognition, tool use simulation, and visual reasoning tasks like maze solving.
Conclusion: The emergent zero-shot capabilities indicate that video models are on a path to becoming unified, generalist vision foundation models, following the same trajectory as LLMs in language understanding.
Abstract: The remarkable zero-shot capabilities of Large Language Models (LLMs) have propelled natural language processing from task-specific models to unified, generalist foundation models. This transformation emerged from simple primitives: large, generative models trained on web-scale data. Curiously, the same primitives apply to today’s generative video models. Could video models be on a trajectory towards general-purpose vision understanding, much like LLMs developed general-purpose language understanding? We demonstrate that Veo 3 can solve a broad variety of tasks it wasn’t explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and more. These abilities to perceive, model, and manipulate the visual world enable early forms of visual reasoning like maze and symmetry solving. Veo’s emergent zero-shot capabilities indicate that video models are on a path to becoming unified, generalist vision foundation models.
[426] Alignment-Sensitive Minimax Rates for Spectral Algorithms with Learned Kernels
Dongming Huang, Zhifan Li, Yicheng Li, Qian Lin
Main category: cs.LG
TL;DR: The paper introduces Effective Span Dimension (ESD), a new complexity measure for spectral algorithms that depends on signal, spectrum, and noise level, and proves minimax risk bounds scaling with σ²K for sequence models.
Details
Motivation: To develop a unified framework for analyzing spectral algorithms when kernels are learned from data, overcoming limitations of traditional fixed-kernel theories that require restrictive eigen-decay or source conditions.Method: Introduces ESD as an alignment-sensitive complexity measure, proves minimax risk bounds for sequence models, analyzes over-parameterized gradient flow’s ability to reduce ESD, and extends the framework to linear models and RKHS regression with numerical validation.
Result: Establishes that for sequence models with ESD ≤ K, minimax excess risk scales as σ²K, and shows gradient flow can reduce ESD, connecting adaptive feature learning to improved generalization in spectral algorithms.
Conclusion: The ESD framework provides a novel perspective on generalization that goes beyond traditional fixed-kernel theories, enabling analysis of kernel learning scenarios without restrictive assumptions.
Abstract: We study spectral algorithms in the setting where kernels are learned from data. We introduce the effective span dimension (ESD), an alignment-sensitive complexity measure that depends jointly on the signal, spectrum, and noise level $\sigma^2$. The ESD is well-defined for arbitrary kernels and signals without requiring eigen-decay conditions or source conditions. We prove that for sequence models whose ESD is at most $K$, the minimax excess risk scales as $\sigma^2 K$. Furthermore, we analyze over-parameterized gradient flow and prove that it can reduce the ESD. This finding establishes a connection between adaptive feature learning and provable improvements in generalization of spectral algorithms. We demonstrate the generality of the ESD framework by extending it to linear models and RKHS regression, and we support the theory with numerical experiments. This framework provides a novel perspective on generalization beyond traditional fixed-kernel theories.
[427] Uncovering Graph Reasoning in Decoder-only Transformers with Circuit Tracing
Xinnan Dai, Chung-Hsiang Lo, Kai Guo, Shenglai Zeng, Dongsheng Luo, Jiliang Tang
Main category: cs.LG
TL;DR: Transformer-based LLMs show strong graph reasoning performance but their internal mechanisms are not well understood. This study uses circuit-tracer framework to analyze decoder-only transformers and identifies two core mechanisms: token merging and structural memorization.
Details
Motivation: To uncover the internal reasoning process mechanisms of transformer-based LLMs in graph reasoning tasks, as these mechanisms remain underexplored despite strong performance.Method: Using basic decoder-only transformers and explaining them through the circuit-tracer framework to visualize reasoning traces and identify core mechanisms in graph reasoning.
Result: Identified two core mechanisms: token merging and structural memorization, which underlie both path reasoning and substructure extraction tasks. Quantified these behaviors and analyzed their relationship with graph density and model size.
Conclusion: Provides a unified interpretability framework for understanding structural reasoning in decoder-only Transformers, offering insights into how these models perform graph reasoning tasks.
Abstract: Transformer-based LLMs demonstrate strong performance on graph reasoning tasks, yet their internal mechanisms remain underexplored. To uncover these reasoning process mechanisms in a fundamental and unified view, we set the basic decoder-only transformers and explain them using the circuit-tracer framework. Through this lens, we visualize reasoning traces and identify two core mechanisms in graph reasoning: token merging and structural memorization, which underlie both path reasoning and substructure extraction tasks. We further quantify these behaviors and analyze how they are influenced by graph density and model size. Our study provides a unified interpretability framework for understanding structural reasoning in decoder-only Transformers.
[428] Graph Variate Neural Networks
Om Roy, Yashar Moshfeghi, Keith Smith
Main category: cs.LG
TL;DR: Graph-Variate Neural Networks (GVNNs) are introduced as a new GNN layer that convolves spatio-temporal signals with signal-dependent connectivity tensors, combining stable long-term support with instantaneous data-driven interactions for dynamic spatio-temporal modeling.
Details
Motivation: Traditional GNNs assume existing underlying graph structures, but many real-world spatio-temporal signals lack predefined graphs or have graphs derived independently from the signal. There's a need to model dynamically evolving functional networks directly from multi-channel data.Method: GVNNs build on Graph Variate Signal Analysis (GVSA) framework, using network tensors of instantaneous connectivity profiles against a stable support constructed from the signal itself. They convolve spatio-temporal signals with signal-dependent connectivity tensors that capture dynamic statistical interdependencies at each time step without sliding windows.
Result: GVNNs achieve linear complexity in sequence length and consistently outperform strong graph-based baselines across forecasting benchmarks. They are competitive with LSTMs and Transformers. On EEG motor-imagery classification, GVNNs achieve strong accuracy for brain-computer interface applications.
Conclusion: GVNNs provide an effective framework for modeling dynamically evolving spatio-temporal signals by combining stable graph support with data-driven instantaneous interactions, demonstrating superior performance over traditional graph-based methods and competitiveness with sequence models.
Abstract: Modelling dynamically evolving spatio-temporal signals is a prominent challenge in the Graph Neural Network (GNN) literature. Notably, GNNs assume an existing underlying graph structure. While this underlying structure may not always exist or is derived independently from the signal, a temporally evolving functional network can always be constructed from multi-channel data. Graph Variate Signal Analysis (GVSA) defines a unified framework consisting of a network tensor of instantaneous connectivity profiles against a stable support usually constructed from the signal itself. Building on GVSA and tools from graph signal processing, we introduce Graph-Variate Neural Networks (GVNNs): layers that convolve spatio-temporal signals with a signal-dependent connectivity tensor combining a stable long-term support with instantaneous, data-driven interactions. This design captures dynamic statistical interdependencies at each time step without ad hoc sliding windows and admits an efficient implementation with linear complexity in sequence length. Across forecasting benchmarks, GVNNs consistently outperform strong graph-based baselines and are competitive with widely used sequence models such as LSTMs and Transformers. On EEG motor-imagery classification, GVNNs achieve strong accuracy highlighting their potential for brain-computer interface applications.
[429] A Recovery Guarantee for Sparse Neural Networks
Sara Fridovich-Keil, Mert Pilanci
Main category: cs.LG
TL;DR: First guarantees for sparse recovery of ReLU neural network weights using iterative hard thresholding with linear memory growth in nonzero weights
Details
Motivation: To provide theoretical guarantees for recovering sparse neural network weights efficiently, addressing memory limitations in existing methodsMethod: Iterative hard thresholding algorithm that recovers sparse network weights with memory growing linearly with the number of nonzero weights
Result: Exact recovery of sparse weights validated through experiments on sparse planted MLPs, MNIST classification, and implicit neural representations
Conclusion: The method achieves competitive or superior performance compared to memory-inefficient iterative magnitude pruning baseline
Abstract: We prove the first guarantees of sparse recovery for ReLU neural networks, where the sparse network weights constitute the signal to be recovered. Specifically, we study structural properties of the sparse network weights for two-layer, scalar-output networks under which a simple iterative hard thresholding algorithm recovers these weights exactly, using memory that grows linearly in the number of nonzero weights. We validate this theoretical result with simple experiments on recovery of sparse planted MLPs, MNIST classification, and implicit neural representations. Experimentally, we find performance that is competitive with, and often exceeds, a high-performing but memory-inefficient baseline based on iterative magnitude pruning.
[430] Feature Dynamics as Implicit Data Augmentation: A Depth-Decomposed View on Deep Neural Network Generalization
Tianyu Ruan, Kuo Gai, Shihua Zhang
Main category: cs.LG
TL;DR: The paper investigates why deep networks generalize well by examining temporal consistency in feature evolution, showing that predictions remain stable when combining shallow features from earlier checkpoints with deeper features from later ones.
Details
Motivation: To understand the fundamental question of why deep networks generalize well, moving beyond classical generalization theory by examining internal feature evolution rather than just inputs and outputs.Method: The study analyzes temporal consistency in deep networks by examining how predictions remain stable when combining features from different training checkpoints, and uses statistical tests to analyze SGD noise patterns.
Result: The research reveals temporal consistency acts as implicit structured augmentation that supports generalization, extends to unseen/corrupted data, collapses with destroyed semantic structure, and shows SGD injects anisotropic noise aligned with principal directions.
Conclusion: The findings provide a conceptual perspective linking feature dynamics to generalization, suggesting future work on practical surrogates for measuring temporal feature evolution.
Abstract: Why do deep networks generalize well? In contrast to classical generalization theory, we approach this fundamental question by examining not only inputs and outputs, but the evolution of internal features. Our study suggests a phenomenon of temporal consistency that predictions remain stable when shallow features from earlier checkpoints combine with deeper features from later ones. This stability is not a trivial convergence artifact. It acts as a form of implicit, structured augmentation that supports generalization. We show that temporal consistency extends to unseen and corrupted data, but collapses when semantic structure is destroyed (e.g., random labels). Statistical tests further reveal that SGD injects anisotropic noise aligned with a few principal directions, reinforcing its role as a source of structured variability. Together, these findings suggest a conceptual perspective that links feature dynamics to generalization, pointing toward future work on practical surrogates for measuring temporal feature evolution.
[431] Spatio-Temporal Directed Graph Learning for Account Takeover Fraud Detection
Mohsen Nayebi Kerdabadi, William Andrew Byron, Xin Sun, Amirfarrokh Iranitalab
Main category: cs.LG
TL;DR: ATLAS is a framework that reformulates Account Takeover (ATO) fraud detection as spatio-temporal node classification on a time-respecting directed session graph, achieving significant improvements in fraud detection while reducing customer friction.
Details
Motivation: Traditional ATO detection systems using tabular gradient-boosted decision trees score sessions independently, overlooking the relational and temporal structure of coordinated attacks and fraud rings that characterize sophisticated fraud patterns.Method: ATLAS links entities via shared identifiers (account, device, IP) with time-window and recency constraints, enabling causal, time-respecting message passing and latency-aware label propagation. It uses inductive GraphSAGE variants trained via neighbor sampling on large-scale session graphs.
Result: On a high-risk digital product at Capital One, ATLAS delivers 6.38% AUC improvement and more than 50% reduction in customer friction, improving fraud capture while reducing user friction.
Conclusion: The ATLAS framework successfully addresses limitations of traditional ATO detection by incorporating spatio-temporal graph structures, demonstrating superior performance in real-world banking applications.
Abstract: Account Takeover (ATO) fraud poses a significant challenge in consumer banking, requiring high recall under strict latency while minimizing friction for legitimate users. Production systems typically rely on tabular gradient-boosted decision trees (e.g., XGBoost) that score sessions independently, overlooking the relational and temporal structure of online activity that characterizes coordinated attacks and “fraud rings.” We introduce ATLAS (Account Takeover Learning Across Spatio-Temporal Directed Graph), a framework that reformulates ATO detection as spatio-temporal node classification on a time-respecting directed session graph. ATLAS links entities via shared identifiers (account, device, IP) and regulates connectivity with time-window and recency constraints, enabling causal, time-respecting message passing and latency-aware label propagation that uses only labels available at scoring time, non-anticipative and leakage-free. We operationalize ATLAS with inductive GraphSAGE variants trained via neighbor sampling, at scale on a sessions graph with more than 100M nodes and around 1B edges. On a high-risk digital product at Capital One, ATLAS delivers 6.38 percent AUC improvement and more than 50 percent reduction in customer friction, improving fraud capture while reducing user friction.
[432] Process-Informed Forecasting of Complex Thermal Dynamics in Pharmaceutical Manufacturing
Ramona Rubini, Siavash Khodakarami, Aniruddha Bora, George Em Karniadakis, Michele Dassisti
Main category: cs.LG
TL;DR: PIF models integrate process knowledge into forecasting for pharmaceutical lyophilization temperature, outperforming pure data-driven models in accuracy, physical consistency, and noise robustness.
Details
Motivation: Deep learning models for industrial forecasting lack physical consistency and robustness, limiting their reliability in regulated environments like pharmaceutical manufacturing.Method: Compared classical models (ARIMA, ETS) and deep learning architectures (KANs) with three process-informed loss functions: fixed-weight loss, dynamic uncertainty-based loss, and Residual-Based Attention mechanism.
Result: PIF models demonstrated superior performance over data-driven counterparts in accuracy, physical plausibility, and noise resilience, with successful transfer learning to new processes.
Conclusion: This work provides a roadmap for developing reliable and generalizable forecasting solutions for critical pharmaceutical manufacturing applications.
Abstract: Accurate time-series forecasting for complex physical systems is the backbone of modern industrial monitoring and control. While deep learning models excel at capturing complex dynamics, currently, their deployment is limited due to physical inconsistency and robustness, hence constraining their reliability in regulated environments. We introduce process-informed forecasting (PIF) models for temperature in pharmaceutical lyophilization. We investigate a wide range of models, from classical ones such as Autoregressive Integrated Moving Average Model (ARIMA) and Exponential Smoothing Model (ETS), to modern deep learning architectures, including Kolmogorov-Arnold Networks (KANs). We compare three different loss function formulations that integrate a process-informed trajectory prior: a fixed-weight loss, a dynamic uncertainty-based loss, and a Residual-Based Attention (RBA) mechanism. We evaluate all models not only for accuracy and physical consistency but also for robustness to sensor noise. Furthermore, we test the practical generalizability of the best model in a transfer learning scenario on a new process. Our results show that PIF models outperform their data-driven counterparts in terms of accuracy, physical plausibility and noise resilience. This work provides a roadmap for developing reliable and generalizable forecasting solutions for critical applications in the pharmaceutical manufacturing landscape.
[433] Markov Decision Processes under External Temporal Processes
Ranga Shaarad Ayyagari, Revanth Raj Eega, Ambedkar Dukkipati
Main category: cs.LG
TL;DR: This paper proposes a reinforcement learning framework for nonstationary environments influenced by external temporal processes, establishing conditions for tractability and analyzing policy iteration algorithms with finite history considerations.
Details
Motivation: Real-world environments are nonstationary and continuously evolve due to external events, while humans make decisions by discerning patterns in historical events. Existing RL algorithms are predominantly designed for stationary environments.Method: The study investigates Markov Decision Processes under external temporal processes, proposes a policy iteration algorithm that learns policies contingent on current state and finite history of prior events, and analyzes least-squares policy evaluation with finite history approximations.
Result: The proposed algorithm is not guaranteed to converge but provides policy improvement guarantees in certain state space regions. The paper establishes sample complexity bounds and analyzes discrete-time Hawkes processes with Gaussian marks as a specific case.
Conclusion: The framework enables RL in nonstationary environments by considering finite history of external events, with theoretical guarantees for policy improvement and sample complexity analysis, validated through experiments in traditional control environments.
Abstract: Reinforcement Learning Algorithms are predominantly developed for stationary environments, and the limited literature that considers nonstationary environments often involves specific assumptions about changes that can occur in transition probability matrices and reward functions. Considering that real-world applications involve environments that continuously evolve due to various external events, and humans make decisions by discerning patterns in historical events, this study investigates Markov Decision Processes under the influence of an external temporal process. We establish the conditions under which the problem becomes tractable, allowing it to be addressed by considering only a finite history of events, based on the properties of the perturbations introduced by the exogenous process. We propose and theoretically analyze a policy iteration algorithm to tackle this problem, which learns policies contingent upon the current state of the environment, as well as a finite history of prior events of the exogenous process. We show that such an algorithm is not guaranteed to converge. However, we provide a guarantee for policy improvement in regions of the state space determined by the approximation error induced by considering tractable policies and value functions. We also establish the sample complexity of least-squares policy evaluation and policy improvement algorithms that consider approximations due to the incorporation of only a finite history of temporal events. While our results are applicable to general discrete-time processes satisfying certain conditions on the rate of decay of the influence of their events, we further analyze the case of discrete-time Hawkes processes with Gaussian marks. We performed experiments to demonstrate our findings for policy evaluation and deployment in traditional control environments.
[434] Pretrained deep models outperform GBDTs in Learning-To-Rank under label scarcity
Charlie Hou, Kiran Koshy Thekumparampil, Michael Shavlovsky, Giulia Fanti, Yesh Dattatreya, Sujay Sanghavi
Main category: cs.LG
TL;DR: DL models outperform GBDTs in tabular Learning-to-Rank when labeled data is scarce, by leveraging unsupervised pretraining on abundant unlabeled data.
Details
Motivation: Previous studies showed DL models underperform GBDTs on tabular data, but these settings didn't capture real-world complexities like label scarcity in ranking applications where unlabeled data is abundant.Method: Use unsupervised pretraining of DL rankers on abundant unlabeled data, then fine-tune on scarce labeled data for tabular Learning-to-Rank tasks.
Result: Pretrained DL rankers consistently outperform GBDT rankers on ranking metrics by up to 38%, both overall and on outlier data, across public and proprietary datasets.
Conclusion: DL models can indeed outperform GBDTs in practical tabular data scenarios, specifically in Learning-to-Rank with label scarcity, when leveraging unsupervised pretraining techniques.
Abstract: On tabular data, a significant body of literature has shown that current deep learning (DL) models perform at best similarly to Gradient Boosted Decision Trees (GBDTs), while significantly underperforming them on outlier data. However, these works often study idealized problem settings which may fail to capture complexities of real-world scenarios. We identify a natural tabular data setting where DL models can outperform GBDTs: tabular Learning-to-Rank (LTR) under label scarcity. Tabular LTR applications, including search and recommendation, often have an abundance of unlabeled data, and scarce labeled data. We show that DL rankers can utilize unsupervised pretraining to exploit this unlabeled data. In extensive experiments over both public and proprietary datasets, we show that pretrained DL rankers consistently outperform GBDT rankers on ranking metrics – sometimes by as much as 38% – both overall and on outliers.
[435] DeNOTS: Stable Deep Neural ODEs for Time Series
Ilya Kuleshov, Evgenia Romanenkova, Vladislav Zhuzhel, Galina Boeva, Evgeni Vorsin, Alexey Zaytsev
Main category: cs.LG
TL;DR: DeNOTS proposes scaling integration time horizon to increase function evaluations (depth) in Neural CDEs, with Negative Feedback stabilization to prevent uncontrolled growth, achieving improved performance over existing methods.
Details
Motivation: Current Neural CDEs regulate function evaluations via solver error tolerance, but lowering tolerances doesn't adequately increase model expressiveness. There's a need for a better way to 'deepen' these models effectively.Method: Scale integration time horizon to increase NFEs (natural analog of depth), and use Negative Feedback stabilization to control dynamics growth. This ensures provable stability without constraining flexibility.
Result: DeNOTS outperforms existing approaches including Neural RDEs and state space models, achieving up to 20% improvement in metrics across four open datasets.
Conclusion: DeNOTS combines expressiveness, stability, and robustness, enabling reliable modeling in continuous-time domains with theoretical bounds for Neural ODE risk using Gaussian process theory.
Abstract: Neural CDEs provide a natural way to process the temporal evolution of
irregular time series. The number of function evaluations (NFE) is these
systems’ natural analog of depth (the number of layers in traditional neural
networks). It is usually regulated via solver error tolerance: lower tolerance
means higher numerical precision, requiring more integration steps. However,
lowering tolerances does not adequately increase the models’ expressiveness. We
propose a simple yet effective alternative: scaling the integration time
horizon to increase NFEs and “deepen`` the model. Increasing the integration
interval causes uncontrollable growth in conventional vector fields, so we also
propose a way to stabilize the dynamics via Negative Feedback (NF). It ensures
provable stability without constraining flexibility. It also implies
robustness: we provide theoretical bounds for Neural ODE risk using Gaussian
process theory. Experiments on four open datasets demonstrate that our method,
DeNOTS, outperforms existing approaches~ – including recent Neural RDEs and
state space models, – ~achieving up to $20%$ improvement in metrics. DeNOTS
combines expressiveness, stability, and robustness, enabling reliable modelling
in continuous-time domains.
[436] On the Integration of Spatial-Temporal Knowledge: A Lightweight Approach to Atmospheric Time Series Forecasting
Yisong Fu, Fei Wang, Zezhi Shao, Boyu Diao, Lin Wu, Zhulin An, Chengqing Yu, Yujie Li, Yongjun Xu
Main category: cs.LG
TL;DR: STELLA is a lightweight atmospheric time series forecasting model that replaces complex Transformer architectures with simple spatial-temporal position embedding and MLP, achieving superior performance with only 10k parameters and 1-hour training.
Details
Motivation: Transformers in atmospheric forecasting have excessive parameters and long training times. The paper discovers that spatial-temporal position embedding alone can effectively model atmospheric correlations without attention mechanisms.Method: Proposes STELLA model using only spatial-temporal position embedding (integrating geographical coordinates and temporal features) with MLP architecture instead of Transformer layers.
Result: STELLA achieves superior performance on five datasets compared to advanced methods, using only 10k parameters and one hour of training.
Conclusion: Spatial-temporal knowledge integration is more effective than complex architectures for atmospheric forecasting, providing novel insights for the field.
Abstract: Transformers have gained attention in atmospheric time series forecasting (ATSF) for their ability to capture global spatial-temporal correlations. However, their complex architectures lead to excessive parameter counts and extended training times, limiting their scalability to large-scale forecasting. In this paper, we revisit ATSF from a theoretical perspective of atmospheric dynamics and uncover a key insight: spatial-temporal position embedding (STPE) can inherently model spatial-temporal correlations even without attention mechanisms. Its effectiveness arises from the integration of geographical coordinates and temporal features, which are intrinsically linked to atmospheric dynamics. Based on this, we propose STELLA, a Spatial-Temporal knowledge Embedded Lightweight modeL for ASTF, utilizing only STPE and an MLP architecture in place of Transformer layers. With 10k parameters and one hour of training, STELLA achieves superior performance on five datasets compared to other advanced methods. The paper emphasizes the effectiveness of spatial-temporal knowledge integration over complex architectures, providing novel insights for ATSF. The code is available at https://github.com/GestaltCogTeam/STELLA.
[437] Robust Training of Neural Networks at Arbitrary Precision and Sparsity
Chengxi Ye, Grace Chu, Yanfeng Liu, Yichi Zhang, Lukasz Lew, Li Zhang, Mark Sandler, Andrew Howard
Main category: cs.LG
TL;DR: The paper introduces a denoising dequantization transform to address the gradient mismatch problem in quantization and sparsification, enabling stable training of ultra-low precision and sparse networks.
Details
Motivation: Standard STE has a mismatch between quantization-aware forward pass and quantization-oblivious backward pass, leading to unmanaged error that corrupts learning, especially in ultra-low precision and sparse regimes.Method: A denoising dequantization transform derived from ridge regression objective makes the learning process aware of quantization error by creating explicit corrective gradient paths. Extends to sparsification as a special form of quantization.
Result: Enables stable training of fully binary (A1W1) and sparse sub-1-bit networks where other methods fail, achieving state-of-the-art results across various precision and sparsity levels.
Conclusion: Provides a theoretically-grounded unified framework for training hyper-efficient neural networks with off-the-shelf recipes at wide spectrum of precisions and sparsity levels.
Abstract: The discontinuous operations inherent in quantization and sparsification introduce a long-standing obstacle to backpropagation, particularly in ultra-low precision and sparse regimes. The standard Straight-Through Estimator (STE) is widely used to address this, but the well-understood mismatch between its quantization-aware forward pass and quantization-oblivious backward pass leads to unmanaged error that can corrupt the learning process. We solve this by introducing a denoising dequantization transform derived from a principled ridge regression objective. This transform makes the entire learning process aware of and robust to the quantization error that STE’s surrogate gradient bypasses, by creating an explicit, corrective gradient path. We extend this principle to sparsification by viewing it as a special form of quantization that maps insignificant values to zero. Our unified framework allows existing models to be trained at a wide spectrum of precisions and sparsity levels with off-the-shelf recipes, achieving stable training of fully binary (A1W1) and sparse sub-1-bit networks where other methods falter. This approach yields state-of-the-art results and provides a theoretically-grounded path to hyper-efficient neural networks.
[438] Representation Convergence: Mutual Distillation is Secretly a Form of Regularization
Zhengpeng Xie, Jiahang Cao, Changwei Wang, Fan Yang, Marco Hutter, Qiang Zhang, Jianxiong Zhang, Renjing Xu
Main category: cs.LG
TL;DR: Mutual distillation between RL policies acts as implicit regularization against overfitting to irrelevant features, improving generalization through enhanced robustness.
Details
Motivation: To understand how mutual distillation improves generalization in reinforcement learning by preventing overfitting to irrelevant features and promoting invariant representations.Method: Theoretical proof that policy robustness to irrelevant features enhances generalization, and empirical demonstration that mutual distillation between policies fosters such robustness and invariant representations.
Result: Mutual distillation enables spontaneous emergence of invariant representations over pixel inputs and improves generalization performance through enhanced robustness.
Conclusion: The paper provides theoretical and empirical evidence that mutual distillation serves as implicit regularization, deepening understanding of generalization mechanisms in RL rather than achieving state-of-the-art performance.
Abstract: In this paper, we argue that mutual distillation between reinforcement learning policies serves as an implicit regularization, preventing them from overfitting to irrelevant features. We highlight two separate contributions: (i) Theoretically, for the first time, we prove that enhancing the policy robustness to irrelevant features leads to improved generalization performance. (ii) Empirically, we demonstrate that mutual distillation between policies contributes to such robustness, enabling the spontaneous emergence of invariant representations over pixel inputs. Ultimately, we do not claim to achieve state-of-the-art performance but rather focus on uncovering the underlying principles of generalization and deepening our understanding of its mechanisms.
[439] Compact Rule-Based Classifier Learning via Gradient Descent
Javier Fumanal-Idocin, Raquel Fernandez-Peralta, Javier Andreu-Perez
Main category: cs.LG
TL;DR: FRR is a gradient-based fuzzy rule learning system that maintains interpretability while achieving competitive performance through semantically meaningful fuzzy logic partitions and sufficient single-rule decision-making.
Details
Motivation: Rule-based models are crucial for high-stakes decision-making due to transparency, but their discrete nature poses optimization and scalability challenges that need to be addressed.Method: Uses Fuzzy Rule-based Reasoner (FRR) with semantically meaningful fuzzy logic partitions and sufficient (single-rule) decision-making to avoid combinatorial complexity of additive rule ensembles.
Result: Superior performance over traditional rule methods (5% average accuracy over RIPPER), comparable accuracy to tree-based models with 90% more compact rule bases, and achieves 96% of state-of-the-art additive rule models’ accuracy using only 3% of their rule base size.
Conclusion: FRR successfully bridges the gap between interpretability and performance in rule-based models by enabling gradient-based optimization while maintaining strict complexity constraints and semantic clarity.
Abstract: Rule-based models are essential for high-stakes decision-making due to their transparency and interpretability, but their discrete nature creates challenges for optimization and scalability. In this work, we present the Fuzzy Rule-based Reasoner (FRR), a novel gradient-based rule learning system that supports strict user constraints over rule-based complexity while achieving competitive performance. To maximize interpretability, the FRR uses semantically meaningful fuzzy logic partitions, unattainable with existing neuro-fuzzy approaches, and sufficient (single-rule) decision-making, which avoids the combinatorial complexity of additive rule ensembles. Through extensive evaluation across 40 datasets, FRR demonstrates: (1) superior performance to traditional rule-based methods (e.g., $5%$ average accuracy over RIPPER); (2) comparable accuracy to tree-based models (e.g., CART) using rule bases $90%$ more compact; and (3) achieves $96%$ of the accuracy of state-of-the-art additive rule-based models while using only sufficient rules and requiring only $3%$ of their rule base size.
[440] Anomaly Detection in Complex Dynamical Systems: A Systematic Framework Using Embedding Theory and Physics-Inspired Consistency
Michael Somma, Thomas Gallien, Branka Stojanovic
Main category: cs.LG
TL;DR: A system-theoretic anomaly detection method using physics-inspired consistency principles and temporal differential consistency autoencoder for oscillatory systems, achieving high performance with 100x computational efficiency.
Details
Motivation: Anomaly detection is crucial for reliability and safety in industrial systems, especially for systems with oscillatory behaviors that require methods capturing structured temporal dependencies while maintaining physical consistency.Method: Proposes Temporal Differential Consistency Autoencoder (TDC-AE) based on Fractal Whitney Embedding Prevalence Theorem, using state-derivative pairs as embedding strategy and TDC-Loss to align latent variable derivatives with dynamic representations.
Result: Evaluated on C-MAPSS turbofan engine dataset, TDC-AE matches LSTM performance, outperforms Transformers, and achieves nearly 100x reduction in MAC operations, making it suitable for edge computing.
Conclusion: Anomalies disrupt stable system dynamics, and the proposed physics-inspired consistency approach provides robust anomaly detection signals while being computationally efficient.
Abstract: Anomaly detection in complex dynamical systems is essential for ensuring reliability, safety, and efficiency in industrial and cyber-physical infrastructures. Predictive maintenance helps prevent costly failures, while cybersecurity monitoring has become critical as digitized systems face growing threats. Many of these systems exhibit oscillatory behaviors and bounded motion, requiring anomaly detection methods that capture structured temporal dependencies while adhering to physical consistency principles. In this work, we propose a system-theoretic approach to anomaly detection, grounded in classical embedding theory and physics-inspired consistency principles. We build upon the Fractal Whitney Embedding Prevalence Theorem that extends traditional embedding techniques to complex system dynamics. Additionally, we introduce state-derivative pairs as an embedding strategy to capture system evolution. To enforce temporal coherence, we develop a Temporal Differential Consistency Autoencoder (TDC-AE), incorporating a TDC-Loss that aligns the approximated derivatives of latent variables with their dynamic representations. We evaluate our method on two subsets (FD001, FD003) of the C-MAPSS dataset, a benchmark for turbofan engine degradation. TDC-AE machtes LSTMs and outperforms Transformers while achieving a nearly 100x reduction in MAC operations, making it particularly suited for lightweight edge computing. Our findings support the hypothesis that anomalies disrupt stable system dynamics, providing a robust signal for anomaly detection.
[441] A Transformer Model for Predicting Chemical Products from Generic SMARTS Templates with Data Augmentation
Derin Ozer, Sylvain Lamprier, Thomas Cauchy, Nicolas Gutowski, Benoit Da Mota
Main category: cs.LG
TL;DR: This paper introduces Broad Reaction Set (BRS) with 20 generic SMARTS templates and ProPreT5, a T5-based model that can directly handle SMARTS templates, along with a novel SMARTS augmentation strategy for improved generalization in chemical reaction prediction.
Details
Motivation: Current chemical reaction prediction models rely on either highly specific reaction templates or template-free methods, both of which have limitations that need to be addressed.Method: Proposed Broad Reaction Set (BRS) with 20 generic SMARTS templates, developed ProPreT5 (T5-based model adapted for chemistry), and introduced the first SMARTS augmentation strategy for structural diversity at pattern level.
Result: ProPreT5 trained on augmented templates demonstrates strong predictive performance and generalization to unseen reactions.
Conclusion: The contributions provide a novel and practical alternative to current methods, advancing template-based reaction prediction in computational chemistry.
Abstract: The accurate prediction of chemical reaction outcomes is a major challenge in computational chemistry. Current models rely heavily on either highly specific reaction templates or template-free methods, both of which present limitations. To address these, this work proposes the Broad Reaction Set (BRS), a set featuring 20 generic reaction templates written in SMARTS, a pattern-based notation designed to describe substructures and reactivity. Additionally, we introduce ProPreT5, a T5-based model specifically adapted for chemistry and, to the best of our knowledge, the first language model capable of directly handling and applying SMARTS reaction templates. To further improve generalization, we propose the first augmentation strategy for SMARTS, which injects structural diversity at the pattern level. Trained on augmented templates, ProPreT5 demonstrates strong predictive performance and generalization to unseen reactions. Together, these contributions provide a novel and practical alternative to current methods, advancing the field of template-based reaction prediction.
[442] DP-LET: An Efficient Spatio-Temporal Network Traffic Prediction Framework
Xintong Wang, Haihan Nan, Ruidong Li, Huaming Wu
Main category: cs.LG
TL;DR: DP-LET is an efficient spatio-temporal network traffic prediction framework that combines data processing, local feature enhancement with TCNs, and Transformer-based prediction to achieve state-of-the-art performance with low computational complexity.
Details
Motivation: Improving prediction accuracy and computational efficiency in spatio-temporal traffic prediction is essential for dynamic resource management and energy conservation in communication systems, as existing methods often incur heavy overhead when capturing local and global feature correlations.Method: The framework consists of three modules: 1) data processing module for denoising and spatial decoupling, 2) local feature enhancement module using multiple Temporal Convolutional Networks (TCNs) to capture fine-grained local features, and 3) Transformer-based prediction module to model long-term dependencies and feature relevance.
Result: DP-LET achieves state-of-the-art performance on real-world cellular traffic prediction, significantly reducing MSE by 31.8% and MAE by 23.1% compared to baseline models while maintaining low computational complexity.
Conclusion: The proposed DP-LET framework effectively balances prediction accuracy and computational efficiency, demonstrating practical utility for network traffic prediction with its modular design combining data processing, local feature enhancement, and Transformer-based prediction.
Abstract: Accurately predicting spatio-temporal network traffic is essential for dynamically managing computing resources in modern communication systems and minimizing energy consumption. Although spatio-temporal traffic prediction has received extensive research attention, further improvements in prediction accuracy and computational efficiency remain necessary. In particular, existing decomposition-based methods or hybrid architectures often incur heavy overhead when capturing local and global feature correlations, necessitating novel approaches that optimize accuracy and complexity. In this paper, we propose an efficient spatio-temporal network traffic prediction framework, DP-LET, which consists of a data processing module, a local feature enhancement module, and a Transformer-based prediction module. The data processing module is designed for high-efficiency denoising of network data and spatial decoupling. In contrast, the local feature enhancement module leverages multiple Temporal Convolutional Networks (TCNs) to capture fine-grained local features. Meanwhile, the prediction module utilizes a Transformer encoder to model long-term dependencies and assess feature relevance. A case study on real-world cellular traffic prediction demonstrates the practicality of DP-LET, which maintains low computational complexity while achieving state-of-the-art performance, significantly reducing MSE by 31.8% and MAE by 23.1% compared to baseline models.
[443] LEMUR Neural Network Dataset: Towards Seamless AutoML
Arash Torabi Goodarzi, Roman Kochnev, Waleed Khalid, Hojjat Torabi Goudarzi, Furui Qin, Tolgay Atinc Uzun, Yashkumar Sanjaybhai Dhameliya, Yash Kanubhai Kathiriya, Zofia Antonina Bentyn, Dmitry Ignatov, Radu Timofte
Main category: cs.LG
TL;DR: LEMUR is an open-source dataset and framework providing standardized PyTorch neural networks across multiple tasks with unified templates, automated hyperparameter optimization, and tools for analysis and visualization.
Details
Motivation: Designing, evaluating, and comparing neural networks is labor-intensive with few standardized model collections, creating barriers to large-scale experimentation and fair benchmarking.Method: Creates a large collection of PyTorch-based neural networks following unified templates, integrates automated hyperparameter optimization via Optuna, provides statistical analysis and visualization tools, and offers an API for performance data access.
Result: LEMUR standardizes implementations and unifies evaluation, providing a structured database with configurations and results to ensure consistency and reproducibility.
Conclusion: LEMUR accelerates AutoML research, enables fair benchmarking, and reduces barriers to large-scale neural network experimentation through its extensible framework released under MIT license.
Abstract: Neural networks are the backbone of modern artificial intelligence, but designing, evaluating, and comparing them remains labor-intensive. While numerous datasets exist for training, there are few standardized collections of the models themselves. We introduce LEMUR, an open-source dataset and framework that provides a large collection of PyTorch-based neural networks across tasks such as classification, segmentation, detection, and natural language processing. Each model follows a unified template, with configurations and results stored in a structured database to ensure consistency and reproducibility. LEMUR integrates automated hyperparameter optimization via Optuna, includes statistical analysis and visualization tools, and offers an API for seamless access to performance data. The framework is extensible, allowing researchers to add new models, datasets, or metrics without breaking compatibility. By standardizing implementations and unifying evaluation, LEMUR aims to accelerate AutoML research, enable fair benchmarking, and reduce barriers to large-scale neural network experimentation. To support adoption and collaboration, LEMUR and its plugins are released under the MIT license at: https://github.com/ABrain-One/nn-dataset https://github.com/ABrain-One/nn-plots https://github.com/ABrain-One/nn-vr
[444] Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO
Peter Chen, Xiaopeng Li, Ziniu Li, Xi Chen, Tianyi Lin
Main category: cs.LG
TL;DR: SGPO addresses GRPO’s limitation in handling all-negative-sample groups by incorporating response diversity and step-wise judgment, improving RL training for LLMs across various model sizes and benchmarks.
Details
Motivation: GRPO fails to update policies when all responses in a group are incorrect, missing learning opportunities from mistakes that humans naturally utilize. This gap between artificial and human intelligence motivates the need for a better approach.Method: Introduces stepwise guided policy optimization (SGPO) which incorporates response diversity within groups using a step-wise judge model that can be trained or adapted from existing LLMs. This diversification accelerates learning dynamics.
Result: SGPO demonstrates consistent gains across model sizes (7B, 14B, 32B) in both offline and online training on 9 benchmarks. It outperforms GRPO especially in early and mid-training stages where all-negative-sample groups are prevalent.
Conclusion: SGPO effectively mitigates GRPO’s all-negative-sample limitation without requiring judge models to generate correct answers, differentiating it from knowledge distillation methods and providing more robust RL training for LLMs.
Abstract: Reinforcement learning (RL) has proven effective in strengthening the reasoning capabilities of large language models (LLMs). A widely adopted method, Group Relative Policy Optimization (GRPO), has shown strong empirical results in training DeepSeek-R1. However, GRPO fails to update the policy when all responses within a group are incorrect (i.e., \emph{all-negative-sample} groups). This limitation underscores a key gap between artificial and human intelligence: unlike humans, who can learn from mistakes, GRPO discards these signals. Our first contribution is to introduce a simple framework that mitigates the all-negative-sample issue by incorporating response diversity within groups using a \textit{step-wise} judge model, which can be either directly trained or adapted from existing LLMs. We prove that this diversification can accelerate GRPO’s learning dynamics in a simplified setting. We also empirically validate the proposed stepwise guided policy optimization (SGPO) method, demonstrating consistent gains across model sizes (7B, 14B, 32B) in offline and online training on 9 benchmarks, including base and distilled variants. Our results highlight two advantages: (i) SGPO surpasses GRPO, especially in the early and mid-training stages where all-negative-sample groups are prevalent; and (ii) SGPO does not require judge models to generate correct answers, differentiating it from knowledge distillation methods.
[445] Localized LoRA: A Structured Low-Rank Approximation for Efficient Fine-Tuning
Babak Barazandeh, Subhabrata Majumdar, Om Rajyaguru, George Michailidis
Main category: cs.LG
TL;DR: Localized LoRA is a parameter-efficient fine-tuning method that applies low-rank updates to structured blocks of weight matrices, enabling dense localized updates without increasing trainable parameters.
Details
Motivation: Existing PEFT methods like LoRA rely on global low-rank structures that overlook spatial patterns across the parameter space, limiting their expressiveness and adaptability.Method: The proposed Localized LoRA models weight updates as a composition of low-rank matrices applied to structured blocks of the weight matrix, allowing for localized updates throughout the parameter space while maintaining the same parameter budget.
Result: The method achieves lower approximation error under matched parameter budgets compared to global and diagonal-local low-rank approximations, and demonstrates improved performance in both synthetic and practical settings.
Conclusion: Localized LoRA provides a more expressive and adaptable alternative to existing PEFT methods, enabling efficient fine-tuning with enhanced performance through localized parameter updates.
Abstract: Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, offer compact and effective alternatives to full model fine-tuning by introducing low-rank updates to pre-trained weights. However, most existing approaches rely on global low rank structures, which can overlook spatial patterns spread across the parameter space. In this work, we propose Localized LoRA, a generalized framework that models weight updates as a composition of low-rank matrices applied to structured blocks of the weight matrix. This formulation enables dense, localized updates throughout the parameter space without increasing the total number of trainable parameters. We provide a formal comparison between global, diagonal-local, and fully localized low-rank approximations, and show that our method consistently achieves lower approximation error under matched parameter budgets. Experiments on both synthetic and practical settings demonstrate that Localized LoRA offers a more expressive and adaptable alternative to existing methods, enabling efficient fine-tuning with improved performance.
[446] Urania: Differentially Private Insights into AI Use
Daogao Liu, Edith Cohen, Badih Ghazi, Peter Kairouz, Pritish Kamath, Alexander Knop, Ravi Kumar, Pasin Manurangsi, Adam Sealfon, Da Yu, Chiyuan Zhang
Main category: cs.LG
TL;DR: Urania is a framework for generating insights from LLM chatbot interactions with differential privacy guarantees using private clustering and keyword extraction methods.
Details
Motivation: To provide rigorous privacy protection while extracting meaningful conversational insights from LLM chatbot interactions, balancing data utility with privacy preservation.Method: Uses private clustering mechanism and three keyword extraction approaches (frequency-based, TF-IDF-based, LLM-guided) with DP tools including clustering, partition selection, and histogram-based summarization.
Result: The framework effectively preserves lexical and semantic content while maintaining privacy, showing enhanced robustness compared to non-private approaches.
Conclusion: Urania successfully balances meaningful insight extraction with stringent user privacy protection through its differential privacy pipeline.
Abstract: We introduce $Urania$, a novel framework for generating insights about LLM chatbot interactions with rigorous differential privacy (DP) guarantees. The framework employs a private clustering mechanism and innovative keyword extraction methods, including frequency-based, TF-IDF-based, and LLM-guided approaches. By leveraging DP tools such as clustering, partition selection, and histogram-based summarization, $Urania$ provides end-to-end privacy protection. Our evaluation assesses lexical and semantic content preservation, pair similarity, and LLM-based metrics, benchmarking against a non-private Clio-inspired pipeline (Tamkin et al., 2024). Moreover, we develop a simple empirical privacy evaluation that demonstrates the enhanced robustness of our DP pipeline. The results show the framework’s ability to extract meaningful conversational insights while maintaining stringent user privacy, effectively balancing data utility with privacy preservation.
[447] CellCLIP – Learning Perturbation Effects in Cell Painting via Text-Guided Contrastive Learning
Mingyu Lu, Ethan Weinberger, Chanwoo Kim, Su-In Lee
Main category: cs.LG
TL;DR: CellCLIP is a cross-modal contrastive learning framework for High-content screening (HCS) data that aligns perturbations with their morphological effects using pre-trained image encoders with channel encoding and natural language encoders.
Details
Motivation: To address challenges in applying cross-modal contrastive learning to HCS data due to semantic differences between Cell Painting images and natural images, and difficulties in representing diverse perturbation types in a unified latent space.Method: Leverages pre-trained image encoders with novel channel encoding scheme to capture microscopy channel relationships, combined with natural language encoders for perturbation representation in a cross-modal contrastive learning framework.
Result: Outperforms current open-source models in cross-modal retrieval and biologically meaningful downstream tasks while achieving significant computation time reductions.
Conclusion: CellCLIP successfully addresses HCS data challenges and demonstrates superior performance in aligning perturbations with morphological effects through effective cross-modal representation learning.
Abstract: High-content screening (HCS) assays based on high-throughput microscopy techniques such as Cell Painting have enabled the interrogation of cells’ morphological responses to perturbations at an unprecedented scale. The collection of such data promises to facilitate a better understanding of the relationships between different perturbations and their effects on cellular state. Towards achieving this goal, recent advances in cross-modal contrastive learning could, in theory, be leveraged to learn a unified latent space that aligns perturbations with their corresponding morphological effects. However, the application of such methods to HCS data is not straightforward due to substantial differences in the semantics of Cell Painting images compared to natural images, and the difficulty of representing different classes of perturbations (e.g., small molecule vs CRISPR gene knockout) in a single latent space. In response to these challenges, here we introduce CellCLIP, a cross-modal contrastive learning framework for HCS data. CellCLIP leverages pre-trained image encoders coupled with a novel channel encoding scheme to better capture relationships between different microscopy channels in image embeddings, along with natural language encoders for representing perturbations. Our framework outperforms current open-source models, demonstrating the best performance in both cross-modal retrieval and biologically meaningful downstream tasks while also achieving significant reductions in computation time.
[448] Why Do Some Inputs Break Low-Bit LLM Quantization?
Ting-Yun Chang, Muru Zhang, Jesse Thomason, Robin Jia
Main category: cs.LG
TL;DR: Analysis of low-bit weight quantization in LLMs reveals strong correlation between quantization errors across methods, identifies residual stream magnitudes as predictive of future errors, and shows that late-layer activations and MLP gates are critical for maintaining performance.
Details
Motivation: Low-bit weight-only quantization reduces LLM memory footprint but disproportionately affects certain examples. Understanding why specific examples suffer large quantization errors and which model components are most affected is crucial for improving quantization techniques.Method: Analyzed diverse 3-4 bit quantization methods on LLMs (7B-70B), examined error correlations across 50 method pairs, studied residual stream magnitudes, used LLM localization techniques, early exiting, and activation patching to identify critical components.
Result: Quantization errors of different methods are strongly correlated (avg. 0.82), residual stream magnitudes predict future quantization errors, and late-layer activations and MLP gates are crucial for maintaining perplexity.
Conclusion: The work reveals why certain examples result in large quantization errors and identifies the model components most critical for performance preservation, providing insights for developing more robust quantization methods.
Abstract: Low-bit weight-only quantization significantly reduces the memory footprint of large language models (LLMs), but disproportionately affects certain examples. We analyze diverse 3-4 bit methods on LLMs ranging from 7B-70B in size and find that the quantization errors of 50 pairs of methods are strongly correlated (avg. 0.82) on FineWeb examples. Moreover, the residual stream magnitudes of full-precision models are indicative of future quantization errors. We further establish a hypothesis that relates the residual stream magnitudes to error amplification and accumulation over layers. Using LLM localization techniques, early exiting, and activation patching, we show that examples with large errors rely on precise residual activations in the late layers, and that the outputs of MLP gates play a crucial role in maintaining the perplexity. Our work reveals why certain examples result in large quantization errors and which model components are most critical for performance preservation.
[449] Quantum-Classical Hybrid Quantized Neural Network
Wenxin Li, Chuan Wang, Hongdong Zhu, Qi Gao, Yin Ma, Hai Wei, Kai Wen
Main category: cs.LG
TL;DR: A novel Quadratic Binary Optimization model for quantized neural network training using spline interpolation, with Forward Interval Propagation to handle nonlinearities and Quantum Conditional Gradient Descent for constraint handling.
Details
Motivation: To enable quantum computers to optimize complex nonlinear neural networks by addressing challenges of non-linearity, multi-layer structure, and constraint handling in large-scale optimization problems.Method: Forward Interval Propagation discretizes activation functions into linear subintervals, while Quantum Conditional Gradient Descent directly solves the Quadratic Constrained Binary Optimization problem with theoretical convergence guarantees.
Result: Theoretical upper bounds on approximation error and Ising spins required, convergence proofs for QCGD under quantum oracle with randomness, and Time-To-Solution bounds for QCBO solving.
Conclusion: The framework broadens quantum computing applicability in AI by preserving neural networks’ universal approximation properties while enabling optimization of complex nonlinear functions on quantum hardware.
Abstract: Here in this work, we present a novel Quadratic Binary Optimization (QBO) model for quantized neural network training, enabling the use of arbitrary activation and loss functions through spline interpolation. We introduce Forward Interval Propagation (FIP), a method designed to tackle the challenges of non-linearity and the multi-layer composite structure in neural networks by discretizing activation functions into linear subintervals. This approach preserves the universal approximation properties of neural networks while allowing complex nonlinear functions to be optimized using quantum computers, thus broadening their applicability in artificial intelligence. We provide theoretical upper bounds on the approximation error and the number of Ising spins required, by deriving the sample complexity of the empirical risk minimization problem, from an optimization perspective. A significant challenge in solving the associated Quadratic Constrained Binary Optimization (QCBO) model on a large scale is the presence of numerous constraints. When employing the penalty method to handle these constraints, tuning a large number of penalty coefficients becomes a critical hyperparameter optimization problem, increasing computational complexity and potentially affecting solution quality. To address this, we employ the Quantum Conditional Gradient Descent (QCGD) algorithm, which leverages quantum computing to directly solve the QCBO problem. We prove the convergence of QCGD under a quantum oracle with randomness and bounded variance in objective value, as well as under limited precision constraints in the coefficient matrix. Additionally, we provide an upper bound on the Time-To-Solution for the QCBO solving process. We further propose a training algorithm with single-sample bit-scale optimization.
[450] Beyond Grids: Multi-objective Bayesian Optimization With Adaptive Discretization
Andi Nika, Sepehr Elahi, Çağın Ararat, Cem Tekin
Main category: cs.LG
TL;DR: Adaptive ε-PAL algorithm for efficient Pareto set identification in vector-valued Gaussian Process optimization with large design spaces
Details
Motivation: Exhaustive search for Pareto optimal designs is infeasible when design space cardinality is large, requiring efficient algorithms that exploit function smoothness and space structureMethod: Tree-based adaptive discretization technique using Gaussian Process sampling to identify ε-accurate Pareto sets with minimal evaluations
Result: Provides information-type and metric dimension-type bounds on sample complexity, and experimentally outperforms other Pareto set identification methods
Conclusion: Adaptive ε-PAL effectively learns Pareto optimal designs in large spaces by leveraging GP smoothness and adaptive discretization
Abstract: We consider the problem of optimizing a vector-valued objective function $\boldsymbol{f}$ sampled from a Gaussian Process (GP) whose index set is a well-behaved, compact metric space $({\cal X},d)$ of designs. We assume that $\boldsymbol{f}$ is not known beforehand and that evaluating $\boldsymbol{f}$ at design $x$ results in a noisy observation of $\boldsymbol{f}(x)$. Since identifying the Pareto optimal designs via exhaustive search is infeasible when the cardinality of ${\cal X}$ is large, we propose an algorithm, called Adaptive $\boldsymbol{\epsilon}$-PAL, that exploits the smoothness of the GP-sampled function and the structure of $({\cal X},d)$ to learn fast. In essence, Adaptive $\boldsymbol{\epsilon}$-PAL employs a tree-based adaptive discretization technique to identify an $\boldsymbol{\epsilon}$-accurate Pareto set of designs in as few evaluations as possible. We provide both information-type and metric dimension-type bounds on the sample complexity of $\boldsymbol{\epsilon}$-accurate Pareto set identification. We also experimentally show that our algorithm outperforms other Pareto set identification methods.
[451] Sample what you cant compress
Vighnesh Birodkar, Gabriel Barcik, James Lyon, Sergey Ioffe, David Minnen, Joshua V. Dillon
Main category: cs.LG
TL;DR: SWYCC combines autoencoder representation learning with diffusion models to improve reconstruction quality and generation compared to GAN-based autoencoders, using a stochastic decoder that can generate details not encoded in the deterministic latent representation.
Details
Motivation: Basic autoencoders produce blurry results, and while GAN/perceptual losses improve quality, they lack principled interpretation. Diffusion models have solid theoretical foundations and produce crisp results, suggesting a better approach for autoencoder learning.Method: Jointly learns a continuous encoder and decoder under a diffusion-based loss function, creating a stochastic decoder that can generate details beyond what’s encoded in the deterministic latent representation.
Result: Better reconstruction quality than GAN-based autoencoders, easier tuning, higher compression, better generation, and representations that are easier to model with latent diffusion models.
Conclusion: The SWYCC approach successfully combines autoencoder representation learning with diffusion, demonstrating superior performance and theoretical grounding compared to existing methods.
Abstract: For learned image representations, basic autoencoders often produce blurry results. Reconstruction quality can be improved by incorporating additional penalties such as adversarial (GAN) and perceptual losses. Arguably, these approaches lack a principled interpretation. Concurrently, in generative settings diffusion has demonstrated a remarkable ability to create crisp, high quality results and has solid theoretical underpinnings (from variational inference to direct study as the Fisher Divergence). Our work combines autoencoder representation learning with diffusion and is, to our knowledge, the first to demonstrate jointly learning a continuous encoder and decoder under a diffusion-based loss and showing that it can lead to higher compression and better generation. We demonstrate that this approach yields better reconstruction quality as compared to GAN-based autoencoders while being easier to tune. We also show that the resulting representation is easier to model with a latent diffusion model as compared to the representation obtained from a state-of-the-art GAN-based loss. Since our decoder is stochastic, it can generate details not encoded in the otherwise deterministic latent representation; we therefore name our approach “Sample what you can’t compress”, or SWYCC for short.
[452] Survey of Deep Learning and Physics-Based Approaches in Computational Wave Imaging
Youzuo Lin, Shihang Feng, James Theiler, Yinpeng Chen, Umberto Villa, Jing Rao, John Greenhall, Cristian Pantea, Mark A. Anastasio, Brendt Wohlberg
Main category: cs.LG
TL;DR: This paper reviews the integration of deep learning with traditional physics-based methods for computational wave imaging (CWI) problems across multiple domains.
Details
Motivation: CWI applications face challenges with traditional physics-based methods being computationally intensive and susceptible to ill-posedness, while machine learning offers new perspectives to address these limitations.Method: The paper presents a structured framework consolidating research from computational imaging, wave physics, and data science, analyzing how deep neural networks enhance and integrate with physics-based CWI methods.
Result: The review systematically analyzes extensive literature to identify important lessons from existing ML-based methods for CWI.
Conclusion: The study identifies technical hurdles and emerging trends in ML-based computational wave imaging through systematic analysis of current research.
Abstract: Computational wave imaging (CWI) extracts hidden structure and physical properties of a volume of material by analyzing wave signals that traverse that volume. Applications include seismic exploration of the Earth’s subsurface, acoustic imaging and non-destructive testing in material science, and ultrasound computed tomography in medicine. Current approaches for solving CWI problems can be divided into two categories: those rooted in traditional physics, and those based on deep learning. Physics-based methods stand out for their ability to provide high-resolution and quantitatively accurate estimates of acoustic properties within the medium. However, they can be computationally intensive and are susceptible to ill-posedness and nonconvexity typical of CWI problems. Machine learning-based computational methods have recently emerged, offering a different perspective to address these challenges. Diverse scientific communities have independently pursued the integration of deep learning in CWI. This review discusses how contemporary scientific machine-learning (ML) techniques, and deep neural networks in particular, have been developed to enhance and integrate with traditional physics-based methods for solving CWI problems. We present a structured framework that consolidates existing research spanning multiple domains, including computational imaging, wave physics, and data science. This study concludes with important lessons learned from existing ML-based methods and identifies technical hurdles and emerging trends through a systematic analysis of the extensive literature on this topic.
[453] Beyond Simple Graphs: Neural Multi-Objective Routing on Multigraphs
Filip Rydin, Attila Lischka, Jiaming Wu, Morteza Haghir Chehreghani, Balázs Kulcsár
Main category: cs.LG
TL;DR: Two GNN-based methods for multi-objective routing on multigraphs: one operates directly on multigraphs via autoregressive edge selection, and another uses learned pruning to simplify multigraphs before routing.
Details
Motivation: Existing routing methods are unsuitable for multigraphs despite their real-world relevance, creating a gap in learning-based routing approaches for scenarios with multiple edges between node pairs.Method: Proposed two approaches: 1) Direct multigraph routing using autoregressive edge selection, 2) More scalable method that first prunes multigraphs via learned strategy then performs routing on simplified graph.
Result: Both models demonstrate competitive performance across various problems and graph distributions compared to strong heuristics and neural baselines.
Conclusion: The proposed methods effectively address multi-objective routing on multigraphs, with the pruning-based approach offering better scalability while maintaining performance.
Abstract: Learning-based methods for routing have gained significant attention in recent years, both in single-objective and multi-objective contexts. Yet, existing methods are unsuitable for routing on multigraphs, which feature multiple edges with distinct attributes between node pairs, despite their strong relevance in real-world scenarios. In this paper, we propose two graph neural network-based methods to address multi-objective routing on multigraphs. Our first approach operates directly on the multigraph by autoregressively selecting edges until a tour is completed. The second model, which is more scalable, first simplifies the multigraph via a learned pruning strategy and then performs autoregressive routing on the resulting simple graph. We evaluate both models empirically, across a wide range of problems and graph distributions, and demonstrate their competitive performance compared to strong heuristics and neural baselines.
[454] Model-Agnostic AI Framework with Explicit Time Integration for Long-Term Fluid Dynamics Prediction
Sunwoong Yang, Ricardo Vinuesa, Namwoo Kang
Main category: cs.LG
TL;DR: This paper introduces a novel framework combining the two-step Adams-Bashforth method with adaptive multi-step rollout strategies to address error accumulation in spatio-temporal auto-regressive predictions, achieving significant improvements in numerical stability and prediction accuracy for complex PDE systems.
Details
Motivation: The study addresses the critical challenge of error accumulation in spatio-temporal auto-regressive predictions within scientific machine learning models, particularly for complex physical systems where conventional methods suffer from instability and poor long-term prediction accuracy.Method: The authors implement the first data-driven two-step Adams-Bashforth method for AR prediction, leveraging historical derivative information. They also develop three novel adaptive weighting strategies that dynamically adjust the importance of different future time steps during multi-step rollout training. The approach is validated on canonical 2D PDEs and complex Navier-Stokes cylinder vortex shedding dynamics.
Result: The framework achieves an 89% improvement over conventional fixed-weight methods while maintaining similar computational costs. For the Navier-Stokes vortex shedding problem, it reduces mean squared error from 0.125 to 0.002 using only 1,177 trainable parameters and 50 training snapshots. The method shows 83% improvement over standard noise injection and maintains robustness under severe spatial constraints.
Conclusion: The integrated methodology demonstrates that sophisticated rollout techniques become essential as physical complexity increases, with the Adams-Bashforth scheme showing consistent robustness across systems. The approach maintains effectiveness even with limited training data and partial spatial domains, making it particularly valuable for complex scientific machine learning applications.
Abstract: This study addresses the critical challenge of error accumulation in spatio-temporal auto-regressive (AR) predictions within scientific machine learning models by exploring temporal integration schemes and adaptive multi-step rollout strategies. We introduce the first implementation of the two-step Adams-Bashforth method specifically tailored for data-driven AR prediction, leveraging historical derivative information to enhance numerical stability without additional computational overhead. To validate our approach, we systematically evaluate time integration schemes across canonical 2D PDEs before extending to complex Navier-Stokes cylinder vortex shedding dynamics. Additionally, we develop three novel adaptive weighting strategies that dynamically adjust the importance of different future time steps during multi-step rollout training. Our analysis reveals that as physical complexity increases, such sophisticated rollout techniques become essential, with the Adams-Bashforth scheme demonstrating consistent robustness across investigated systems and our best adaptive approach delivering an 89% improvement over conventional fixed-weight methods while maintaining similar computational costs. For the complex Navier-Stokes vortex shedding problem, despite using an extremely lightweight graph neural network with just 1,177 trainable parameters and training on only 50 snapshots, our framework accurately predicts 350 future time steps reducing mean squared error from 0.125 (single-step direct prediction) to 0.002 (Adams-Bashforth with proposed multi-step rollout). Our integrated methodology demonstrates an 83% improvement over standard noise injection techniques and maintains robustness under severe spatial constraints; specifically, when trained on only a partial spatial domain, it still achieves 58% and 27% improvements over direct prediction and forward Euler methods, respectively.
[455] Structure As Search: Unsupervised Permutation Learning for Combinatorial Optimization
Yimeng Min, Carla P. Gomes
Main category: cs.LG
TL;DR: A non-autoregressive neural approach for TSP that learns permutation matrices directly without search, achieving competitive performance with classical heuristics.
Details
Motivation: To demonstrate that neural networks can directly capture combinatorial structure without requiring sequential decision-making or explicit search procedures.Method: Apply similarity transformation to Hamiltonian cycles and learn to approximate permutation matrices via continuous relaxations in an unsupervised framework.
Result: Competitive performance against classical heuristics on the Travelling Salesman Problem.
Conclusion: Neural networks can directly capture and exploit combinatorial structure, offering evidence that non-autoregressive approaches are effective for combinatorial optimization.
Abstract: We propose a non-autoregressive framework for the Travelling Salesman Problem where solutions emerge directly from learned permutations, without requiring explicit search. By applying a similarity transformation to Hamiltonian cycles, the model learns to approximate permutation matrices via continuous relaxations. Our unsupervised approach achieves competitive performance against classical heuristics, demonstrating that the inherent structure of the problem can effectively guide combinatorial optimization without sequential decision-making. Our method offers concrete evidence that neural networks can directly capture and exploit combinatorial structure.
[456] LLMs for Cold-Start Cutting Plane Separator Configuration
Connor Lawless, Yingxi Li, Anders Wikum, Madeleine Udell, Ellen Vitercik
Main category: cs.LG
TL;DR: LLM-based framework for configuring MILP solver parameters using problem descriptions and separator summaries, with ensembling to reduce variance and create high-performing portfolios.
Details
Motivation: MILP solvers have hundreds of parameters that significantly impact performance but are difficult to configure, especially for non-expert users. Existing ML approaches require extensive training data and don't integrate well with solver workflows.Method: Uses large language models to configure cutting plane separators based on problem descriptions and solver-specific separator summaries. Introduces ensembling strategy that clusters and aggregates candidate configurations into a small portfolio.
Result: The approach matches or outperforms state-of-the-art configuration methods with significantly less data and computation. Generates configurations in seconds via simple API calls without custom solver interfaces.
Conclusion: LLM-based configuration provides an efficient, accessible alternative to traditional parameter tuning methods, requiring minimal data and computational resources while maintaining high performance.
Abstract: Mixed integer linear programming (MILP) solvers expose hundreds of parameters that have an outsized impact on performance but are difficult to configure for all but expert users. Existing machine learning (ML) approaches require training on thousands of related instances, generalize poorly and can be difficult to integrate into existing solver workflows. We propose a large language model (LLM)-based framework that configures cutting plane separators using problem descriptions and solver-specific separator summaries. To reduce variance in LLM outputs, we introduce an ensembling strategy that clusters and aggregates candidate configurations into a small portfolio of high-performing configurations. Our method requires no custom solver interface, generates configurations in seconds via simple API calls, and requires solving only a small number of instances. Extensive experiments on standard synthetic and real-world MILPs show our approach matches or outperforms state-of-the-art configuration methods with a fraction of the data and computation.
[457] LoSiA: Efficient High-Rank Fine-Tuning via Subnet Localization and Optimization
Xujia Wang, Yunjia Qi, Bin Xu
Main category: cs.LG
TL;DR: LoSiA is a parameter-efficient fine-tuning method that dynamically identifies and optimizes critical sub-networks using gradient sparsity analysis, reducing computational overhead while maintaining performance comparable to full fine-tuning.
Details
Motivation: Existing PEFT methods like LoRA perform extensive matrix multiplications in domain specialization tasks, leading to computational inefficiency and sub-optimal fine-tuning performance.Method: LoSiA identifies a sub-network using gradient sparsity analysis and optimizes only these critical parameters, enabling effective high-rank adaptation while reducing additional matrix multiplication. LoSiA-Pro is a faster implementation that reduces training latency by about 27% compared to LoRA.
Result: Extensive evaluations show minimal performance drop compared to full fine-tuning while requiring the least training time across domain specialization and common-sense reasoning tasks. LoSiA also reduces forgetting during continued training.
Conclusion: LoSiA provides an efficient alternative to existing PEFT methods by dynamically localizing and optimizing critical parameters, achieving better computational efficiency without sacrificing performance.
Abstract: Parameter-Efficient Fine-Tuning (PEFT) methods, such as LoRA, significantly reduce the number of trainable parameters by introducing low-rank decomposition matrices. However, existing methods perform extensive matrix multiplications in domain specialization tasks, resulting in computational inefficiency and sub-optimal fine-tuning performance. Hence, we propose LoSiA(Low-Resources Subnet Integration Adaptation), an innovative method that dynamically localizes and optimizes critical parameters during the training process. Specifically, it identifies a sub-network using gradient sparsity analysis and optimizes it as the trainable target. This design enables effective high-rank adaptation by updating only the sub-network parameters, reducing the additional matrix multiplication. We also present LoSiA-Pro, a faster implementation of LoSiA, which reduces the training latency by about $27%$ compared to LoRA. Extensive evaluations show that our method achieves minimal performance drop compared to full fine-tuning, while requiring the least training time across domain specialization and common-sense reasoning tasks. Further analysis shows that LoSiA also reduces forgetting during continued training. The source code is available at https://github.com/KlozeWang/LoSiA.
[458] Increasing Batch Size Improves Convergence of Stochastic Gradient Descent with Momentum
Keisuke Kamo, Hideaki Iiduka
Main category: cs.LG
TL;DR: This paper shows that using an increasing batch size in mini-batch SGDM improves convergence to stationary points faster than constant batch size, while reducing computational cost.
Details
Motivation: While theoretical studies show learning rate and momentum affect SGDM convergence, practical studies indicate batch size strongly impacts performance. The authors aim to theoretically analyze how batch size affects mini-batch SGDM convergence.Method: Theoretical analysis of mini-batch SGDM with constant learning rate and momentum weight, comparing constant vs increasing batch sizes. Numerical experiments validate the theoretical findings.
Result: Theoretical proof shows constant batch size doesn’t always minimize full gradient norm expectation, while increasing batch size definitely minimizes it. Numerical results confirm faster convergence to stationary points with increasing batch size.
Conclusion: Increasing batch size in mini-batch SGDM improves convergence efficiency and reduces computational cost compared to constant batch size approach.
Abstract: Stochastic gradient descent with momentum (SGDM), in which a momentum term is added to SGD, has been well studied in both theory and practice. The theoretical studies show that the settings of the learning rate and momentum weight affect the convergence of SGDM. Meanwhile, the practical studies have shown that the batch-size setting strongly affects the performance of SGDM. In this paper, we focus on mini-batch SGDM with a constant learning rate and constant momentum weight, which is frequently used to train deep neural networks. We show theoretically that using a constant batch size does not always minimize the expectation of the full gradient norm of the empirical loss in training a deep neural network, whereas using an increasing batch size definitely minimizes it; that is, an increasing batch size improves the convergence of mini-batch SGDM. We also provide numerical results supporting our analyses, indicating specifically that mini-batch SGDM with an increasing batch size converges to stationary points faster than with a constant batch size, while also reducing computational cost. Python implementations of the optimizers used in the numerical experiments are available at https://github.com/iiduka-researches/NSHB_increasing_batchsize_acml25/.
[459] Bias-variance decompositions: the exclusive privilege of Bregman divergences
Tom Heskes
Main category: cs.LG
TL;DR: This paper proves that g-Bregman divergences are the only loss functions satisfying certain conditions that permit clean bias-variance decompositions, explaining why common metrics like 0-1 and L1 losses fail to have such decompositions.
Details
Motivation: Bias-variance decompositions are crucial for understanding model generalization, but existing decompositions are limited to specific loss functions like squared error. The paper aims to identify the necessary and sufficient conditions for clean decompositions across broader loss functions.Method: The authors study continuous, nonnegative loss functions satisfying identity of indiscernibles under mild regularity conditions. They prove that only g-Bregman divergences permit clean bias-variance decompositions, and show these can be transformed into standard Bregman divergences via invertible variable changes.
Result: The paper establishes that g-Bregman divergences are the exclusive class of loss functions with clean bias-variance decompositions. This explains why previous attempts with 0-1 and L1 losses failed. The squared Mahalanobis distance is identified as the only symmetric loss function (up to variable transformation) with this property.
Conclusion: The research provides fundamental theoretical limitations on bias-variance decompositions, showing they are only possible for specific mathematical structures (g-Bregman divergences), which has important implications for understanding model generalization with different loss functions.
Abstract: Bias-variance decompositions are widely used to understand the generalization performance of machine learning models. While the squared error loss permits a straightforward decomposition, other loss functions - such as zero-one loss or $L_1$ loss - either fail to sum bias and variance to the expected loss or rely on definitions that lack the essential properties of meaningful bias and variance. Recent research has shown that clean decompositions can be achieved for the broader class of Bregman divergences, with the cross-entropy loss as a special case. However, the necessary and sufficient conditions for these decompositions remain an open question. In this paper, we address this question by studying continuous, nonnegative loss functions that satisfy the identity of indiscernibles (zero loss if and only if the two arguments are identical), under mild regularity conditions. We prove that so-called $g$-Bregman divergences are the only such loss functions that have a clean bias-variance decomposition. A $g$-Bregman divergence can be transformed into a standard Bregman divergence through an invertible change of variables. This makes the squared Mahalanobis distance, up to such a variable transformation, the only symmetric loss function with a clean bias-variance decomposition. Consequently, common metrics such as $0$-$1$ and $L_1$ losses cannot admit a clean bias-variance decomposition, explaining why previous attempts have failed. We also examine the impact of relaxing the restrictions on the loss functions and how this affects our results.
[460] Enhanced uncertainty quantification variational autoencoders for the solution of Bayesian inverse problems
Andrea Tonini, Luca Dede’
Main category: cs.LG
TL;DR: Proposes a novel loss function for training variational autoencoders (VAEs) for Bayesian inverse problems, with theoretical convergence proof for affine forward maps and validation through numerical tests.
Details
Motivation: To improve Bayesian inverse problem solving using neural networks, specifically enhancing variational autoencoders for real-time inverse uncertainty quantification of model parameters.Method: Developed a novel loss function for training VAEs, provided theoretical convergence proof for latent states to posterior distribution when forward map is affine, and conducted numerical validation including comparison with existing VAEs and Markov Chain Monte Carlo.
Result: The proposed VAE with novel loss function shows improved accuracy and generalization properties compared to existing methods, with theoretical convergence validated through numerical tests on Laplace equation.
Conclusion: The novel loss function enhances VAE performance for Bayesian inverse problems, providing theoretical guarantees and practical improvements over existing approaches.
Abstract: Among other uses, neural networks are a powerful tool for solving deterministic and Bayesian inverse problems in real-time, where variational autoencoders, a specialized type of neural network, enable the Bayesian estimation of model parameters and their distribution from observational data allowing real-time inverse uncertainty quantification. In this work, we build upon existing research [Goh, H. et al., Proceedings of Machine Learning Research, 2022] by proposing a novel loss function to train variational autoencoders for Bayesian inverse problems. When the forward map is affine, we provide a theoretical proof of the convergence of the latent states of variational autoencoders to the posterior distribution of the model parameters. We validate this theoretical result through numerical tests and we compare the proposed variational autoencoder with the existing one in the literature both in terms of accuracy and generalization properties. Finally, we test the proposed variational autoencoder on a Laplace equation, with comparison to the original one and Markov Chains Monte Carlo.
[461] Assay2Mol: large language model-based drug design using BioAssay context
Yifan Deng, Spencer S. Ericksen, Anthony Gitter
Main category: cs.LG
TL;DR: Assay2Mol is an LLM-based workflow that leverages unstructured text from biochemical screening assays to generate candidate molecules for drug discovery, outperforming structure-based ML approaches.
Details
Motivation: Scientific databases contain vast unstructured text describing biological mechanisms and experimental protocols in biochemical assays, which remains untapped for drug discovery despite its rich information.Method: Assay2Mol retrieves existing assay records with similar targets and generates candidate molecules using in-context learning with the retrieved screening data.
Result: Assay2Mol outperforms recent machine learning approaches that generate candidate ligand molecules for target protein structures, while promoting more synthesizable molecule generation.
Conclusion: The approach successfully capitalizes on unstructured text in biochemical assays for early-stage drug discovery, demonstrating superior performance over structure-based methods.
Abstract: Scientific databases aggregate vast amounts of quantitative data alongside descriptive text. In biochemistry, molecule screening assays evaluate candidate molecules’ functional responses against disease targets. Unstructured text that describes the biological mechanisms through which these targets operate, experimental screening protocols, and other attributes of assays offer rich information for drug discovery campaigns but has been untapped because of that unstructured format. We present Assay2Mol, a large language model-based workflow that can capitalize on the vast existing biochemical screening assays for early-stage drug discovery. Assay2Mol retrieves existing assay records involving targets similar to the new target and generates candidate molecules using in-context learning with the retrieved assay screening data. Assay2Mol outperforms recent machine learning approaches that generate candidate ligand molecules for target protein structures, while also promoting more synthesizable molecule generation.
[462] Diffusion Classifier-Driven Reward for Offline Preference-based Reinforcement Learning
Teng Pang, Bingzheng Wang, Guoqiang Wu, Yilong Yin
Main category: cs.LG
TL;DR: DPR and C-DPR are novel diffusion-based methods for offline preference-based RL that improve step-wise reward inference from trajectory-wise preferences, outperforming traditional Bradley-Terry models.
Details
Motivation: Trajectory-wise preference labels in offline PbRL are insufficient for precise step-wise reward learning, which affects downstream algorithm performance. Current methods struggle with accurate step-wise reward inference from coarse trajectory-level preferences.Method: DPR treats step-wise reward acquisition as binary classification using diffusion classifiers. C-DPR extends this by conditioning on trajectory-wise preference labels to enhance reward inference. Both methods leverage the robustness of diffusion models for discriminative reward learning.
Result: Experimental results show that diffusion classifier-driven reward acquisition outperforms previous methods using the Bradley-Terry model when applied to existing offline RL algorithms.
Conclusion: Diffusion-based classifiers provide an effective approach for precise step-wise reward inference in offline PbRL, addressing the limitations of trajectory-wise preference labels and improving overall algorithm performance.
Abstract: Offline preference-based reinforcement learning (PbRL) mitigates the need for reward definition, aligning with human preferences via preference-driven reward feedback without interacting with the environment. However, trajectory-wise preference labels are difficult to meet the precise learning of step-wise reward, thereby affecting the performance of downstream algorithms. To alleviate the insufficient step-wise reward caused by trajectory-wise preferences, we propose a novel preference-based reward acquisition method: Diffusion Preference-based Reward (DPR). DPR directly treats step-wise preference-based reward acquisition as a binary classification and utilizes the robustness of diffusion classifiers to infer step-wise rewards discriminatively. In addition, to further utilize trajectory-wise preference information, we propose Conditional Diffusion Preference-based Reward (C-DPR), which conditions on trajectory-wise preference labels to enhance reward inference. We apply the above methods to existing offline RL algorithms, and a series of experimental results demonstrate that the diffusion classifier-driven reward outperforms the previous reward acquisition method with the Bradley-Terry model.
[463] CANDLE: A Cross-Modal Agentic Knowledge Distillation Framework for Interpretable Sarcopenia Diagnosis
Yuqi Jin, Zhenhao Shuai, Zihan Hu, Weiteng Zhang, Weihao Xie, Jianwei Shuai, Xian Shen, Zhen Feng
Main category: cs.LG
TL;DR: CANDLE framework integrates SHAP explanations from traditional ML models with LLM reasoning via reinforcement learning to create interpretable diagnostic systems for sarcopenia.
Details
Motivation: Address the interpretability-performance trade-off in medical AI by combining the transparency of traditional ML with the semantic reasoning capabilities of LLMs.Method: Extract SHAP explanations from XGBoost model, transform them into structured representations, use actor-critic RL to guide LLM reasoning, and deploy via retrieval-augmented generation (RAG).
Result: Omitted in the abstract, but framework claims to mitigate interpretability-performance trade-off, enhance predictive accuracy, and preserve decision consistency.
Conclusion: CANDLE offers scalable approach to knowledge assetization of traditional ML models, enabling interpretable and clinically aligned decision support in medical domains.
Abstract: Background and Aims: Large language models (LLMs) have shown remarkable generalization and transfer capabilities by learning from vast corpora of text and web data. Their semantic representations allow cross-task knowledge transfer and reasoning, offering promising opportunities for data-scarce and heterogeneous domains such as clinical medicine. Yet, in diagnostic tasks like sarcopenia, major challenges remain: interpretability, transparency, and deployment efficiency. Traditional machine learning (TML) models provide stable performance and feature-level attribution, ensuring traceable and auditable decision logic, but lack semantic breadth. Conversely, LLMs enable flexible inference but often function as opaque predictors. Existing integration strategies remain shallow, rarely embedding the structured reasoning of TML into LLM inference. Methods: Using sarcopenia diagnosis as a case study, SHapley Additive exPlanations (SHAP) were extracted from a baseline XGBoost model and transformed into structured, LLM-compatible representations. An actor-critic reinforcement learning (RL) strategy guided the LLM to reason over these SHAP-based inputs, producing calibrated rationales and refined decision rules. The distilled reasoning was consolidated into a structured knowledge repository and deployed via retrieval-augmented generation (RAG) for case-based inference. Results: (Omitted here.) Conclusion: By coupling SHAP-derived statistical evidence with reinforcement-trained LLM reasoning, CANDLE mitigates the interpretability-performance trade-off, enhances predictive accuracy, and preserves high decision consistency. The framework offers a scalable approach to knowledge assetization of TML models, enabling interpretable, reproducible, and clinically aligned decision support in sarcopenia and potentially broader medical domains.
[464] MAME: Multidimensional Adaptive Metamer Exploration with Human Perceptual Feedback
Mina Kamao, Hayato Ono, Ayumu Yamashita, Kaoru Amano, Masataka Sawayama
Main category: cs.LG
TL;DR: The paper introduces MAME, a framework for directly exploring human metameric spaces through adaptive image generation guided by human perceptual feedback, revealing better alignment between humans and CNN models at high-level vs low-level visual processing.
Details
Motivation: Current methods for studying brain-model alignment rely on indirect approaches where model metamers are tested on humans. There's a need for direct exploration of human metameric spaces to better understand visual system organization.Method: MAME (Multidimensional Adaptive Metamer Exploration) framework that modulates reference images across multiple dimensions based on hierarchical neural network responses, adaptively updating generation parameters according to participants’ perceptual discriminability in online experiments.
Result: Human discrimination sensitivity was lower for metameric images based on low-level features compared to high-level features, suggesting worse alignment between human and CNN metameric spaces for low-level processing. This finding contradicts expectations from recent discussions on higher-level alignment.
Conclusion: The results highlight the importance of early visual computations in developing biologically plausible models. MAME serves as a valuable tool for directly investigating human visual system organization and brain-model alignment.
Abstract: Alignment between human brain networks and artificial models has become an active research area in vision science and machine learning. A widely adopted approach is identifying “metamers,” stimuli physically different yet perceptually equivalent within a system. However, conventional methods lack a direct approach to searching for the human metameric space. Instead, researchers first develop biologically inspired models and then infer about human metamers indirectly by testing whether model metamers also appear as metamers to humans. Here, we propose the Multidimensional Adaptive Metamer Exploration (MAME) framework, enabling direct, high-dimensional exploration of human metameric spaces through online image generation guided by human perceptual feedback. MAME modulates reference images across multiple dimensions based on hierarchical neural network responses, adaptively updating generation parameters according to participants’ perceptual discriminability. Using MAME, we successfully measured multidimensional human metameric spaces within a single psychophysical experiment. Experimental results using a biologically plausible CNN model showed that human discrimination sensitivity was lower for metameric images based on low-level features compared to high-level features, which image contrast metrics could not explain. The finding suggests a relatively worse alignment between the metameric spaces of humans and the CNN model for low-level processing compared to high-level processing. Counterintuitively, given recent discussions on alignment at higher representational levels, our results highlight the importance of early visual computations in shaping biologically plausible models. Our MAME framework can serve as a future scientific tool for directly investigating the functional organization of human vision.
[465] Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning
Hu Wang, Congbo Ma, Ian Reid, Mohammad Yaqub
Main category: cs.LG
TL;DR: KRPO improves GRPO by using Kalman filtering for dynamic baseline estimation instead of naive group mean, reducing variance in advantage estimation for language model policy optimization.
Details
Motivation: GRPO's use of mean reward as baseline can lead to high variance when reward advantage is inaccurately predicted, especially with dynamic reward signals in language modeling.Method: Proposes Kalman Filter Enhanced Group Relative Policy Optimization (KRPO) that uses lightweight Kalman filtering to dynamically estimate latent reward baseline and uncertainty, replacing the naive group mean in GRPO without adding learned parameters.
Result: KRPO improves stability and performance of GRPO, as shown through accuracies and rewards from math question answering and reasoning tasks.
Conclusion: KRPO provides a simple yet effective way to incorporate multiple outputs into advantage estimation, enhancing policy optimization for language models dealing with dynamic reward signals.
Abstract: The advantage function is a central concept in RL that helps reduce variance in policy gradient estimates. Recently, for language modeling, Group Relative Policy Optimization (GRPO) was proposed to compute the advantage for each output by subtracting the mean reward, as the baseline, for all outputs in the group. However, it can lead to high variance when the reward advantage is inaccurately predicted. In this work, we propose Kalman Filter Enhanced Group Relative Policy Optimization (KRPO) model, by using lightweight Kalman filtering to dynamically estimate the latent reward baseline and uncertainty. This filtering technique replaces the naive group mean, enabling more adaptive advantage normalization. Our method does not require additional learned parameters over GRPO. This approach offers a simple yet effective way to incorporate multiple outputs of GRPO into advantage estimation, improving policy optimization in settings where highly dynamic reward signals are difficult to model for language models. Through the accuracies and rewards obtained from math question answering and reasoning, we show that using a more adaptive advantage estimation model, KRPO can improve the stability and performance of GRPO. The code is available at https://github.com/billhhh/KRPO_LLMs_RL.
[466] Kron-LoRA: Hybrid Kronecker-LoRA Adapters for Scalable, Sustainable Fine-tuning
Yixin Shen
Main category: cs.LG
TL;DR: Kron-LoRA is a hybrid adapter combining Kronecker-structured factorization with LoRA compression, achieving 4× parameter reduction while maintaining performance comparable to standard LoRA.
Details
Motivation: To create parameter-efficient and expressive adapters for fine-tuning massive pre-trained language models across multiple tasks, addressing the need for scalable and sustainable multi-task adaptation.Method: Integrates Kronecker-structured factorization with low-rank LoRA compression, a novel combination in parameter-efficient fine-tuning that reduces parameters while preserving expressivity.
Result: Achieves up to 4× fewer parameters than standard LoRA while matching or exceeding baseline performance on DistilBERT, Mistral-7B, LLaMA-2-7B, and LLaMA-3-8B across eight benchmarks, with modest memory savings and only 5-8% speed overhead.
Conclusion: Kron-LoRA provides a scalable, sustainable solution for multi-task adaptation of large language models, delivering competitive cross-task transfer with only one-quarter of adapter parameters.
Abstract: Fine-tuning massive pre-trained language models across many tasks demands adapters that are both parameter-efficient and expressive. We introduce \textbf{Kron-LoRA}, a hybrid adapter that combines Kronecker-structured factorization with low-rank LoRA compression-an integration that, to our knowledge, has not been explored in parameter-efficient fine-tuning or in matrix approximation literature. Kron-LoRA achieves up to 4$\times$ fewer parameters than standard LoRA while retaining similar expressivity. Experiments on DistilBERT, Mistral-7B, LLaMA-2-7B, and LLaMA-3-8B across eight benchmarks show that Kron-LoRA matches or exceeds LoRA baselines with modest memory savings and only a 5-8% speed overhead. In sequential fine-tuning, it also delivers competitive cross-task transfer despite using only one-quarter of the adapter parameters. Kron-LoRA thus offers a scalable, sustainable solution for multi-task adaptation of large language models.
[467] Surrogate Modeling of 3D Rayleigh-Benard Convection with Equivariant Autoencoders
Fynn Fromme, Hans Harder, Christine Allen-Blanchette, Sebastian Peitz
Main category: cs.LG
TL;DR: The paper presents an end-to-end equivariant surrogate model using G-steerable kernels for modeling 3D Rayleigh-Bénard convection, demonstrating improved sample and parameter efficiency.
Details
Motivation: Machine learning for large-scale physics systems governed by PDEs faces challenges with high degrees of freedom and complex multi-scale dynamics, requiring improved accuracy and sample efficiency.Method: An equivariant surrogate model combining an equivariant convolutional autoencoder and equivariant convolutional LSTM with G-steerable kernels, specifically using vertically stacked D4-steerable kernels with partial kernel sharing in the vertical direction.
Result: The model achieves significant gains in sample and parameter efficiency, and better scaling to complex dynamics compared to traditional approaches.
Conclusion: The proposed equivariant architecture effectively handles the E(2)-equivariance in horizontal planes while accommodating broken translational equivariance in the vertical direction, providing an efficient solution for physics system modeling.
Abstract: The use of machine learning for modeling, understanding, and controlling large-scale physics systems is quickly gaining in popularity, with examples ranging from electromagnetism over nuclear fusion reactors and magneto-hydrodynamics to fluid mechanics and climate modeling. These systems - governed by partial differential equations - present unique challenges regarding the large number of degrees of freedom and the complex dynamics over many scales both in space and time, and additional measures to improve accuracy and sample efficiency are highly desirable. We present an end-to-end equivariant surrogate model consisting of an equivariant convolutional autoencoder and an equivariant convolutional LSTM using $G$-steerable kernels. As a case study, we consider the three-dimensional Rayleigh-B'enard convection, which describes the buoyancy-driven fluid flow between a heated bottom and a cooled top plate. While the system is E(2)-equivariant in the horizontal plane, the boundary conditions break the translational equivariance in the vertical direction. Our architecture leverages vertically stacked layers of $D_4$-steerable kernels, with additional partial kernel sharing in the vertical direction for further efficiency improvement. We demonstrate significant gains in sample and parameter efficiency, as well as a better scaling to more complex dynamics. The accompanying code is available under https://github.com/FynnFromme/equivariant-rb-forecasting.
[468] AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage Momentum
Jian Xiong, Jingbo Zhou, Jingyong Ye, Qiang Huang, Dejing Dou
Main category: cs.LG
TL;DR: AAPO is a novel RL algorithm that improves training efficiency in group relative advantage estimation for LLMs by using momentum-based advantage enhancement.
Details
Motivation: Existing group relative advantage estimation methods like GRPO suffer from training inefficiencies when estimated advantages approach zero, limiting their effectiveness in enhancing LLM reasoning capabilities.Method: Advantage-Augmented Policy Optimization (AAPO) optimizes cross-entropy loss using advantages enhanced through a momentum-based estimation scheme to mitigate inefficiencies in group relative advantage estimation.
Result: Experimental results on multiple mathematical reasoning benchmarks demonstrate AAPO’s superior performance compared to existing methods.
Conclusion: AAPO effectively addresses training inefficiencies in group relative advantage estimation, providing a more efficient RL approach for enhancing LLM reasoning capabilities without dependency on value models.
Abstract: Reinforcement learning (RL) has emerged as an effective approach for enhancing the reasoning capabilities of large language models (LLMs), especially in scenarios where supervised fine-tuning (SFT) falls short due to limited chain-of-thought (CoT) data. Among RL-based post-training methods, group relative advantage estimation, as exemplified by Group Relative Policy Optimization (GRPO), has attracted considerable attention for eliminating the dependency on the value model, thereby simplifying training compared to traditional approaches like Proximal Policy Optimization (PPO). However, we observe that exsiting group relative advantage estimation method still suffers from training inefficiencies, particularly when the estimated advantage approaches zero. To address this limitation, we propose Advantage-Augmented Policy Optimization (AAPO), a novel RL algorithm that optimizes the cross-entropy (CE) loss using advantages enhanced through a momentum-based estimation scheme. This approach effectively mitigates the inefficiencies associated with group relative advantage estimation. Experimental results on multiple mathematical reasoning benchmarks demonstrate the superior performance of AAPO.
[469] EC-LDA : Label Distribution Inference Attack against Federated Graph Learning with Embedding Compression
Tong Cheng, Jie Fu, Xinpeng Ling, Huifa Li, Zhili Chen, Haifeng Qian, Junqing Gong
Main category: cs.LG
TL;DR: This paper proposes EC-LDA, a novel label distribution attack on Federated Graph Learning that compresses node embeddings to significantly improve attack effectiveness for inferring client data label distributions.
Details
Motivation: Federated Graph Learning allows collaborative training while keeping client data localized, but malicious servers can still steal private information through uploaded gradients. The authors aim to develop more effective attacks to expose vulnerabilities in FGL systems.Method: The authors observe that attack effectiveness relates to node embedding variance in GNNs, analyze this relationship, and propose EC-LDA which compresses node embeddings to enhance attack performance.
Result: Extensive experiments on node classification and link prediction across six graph datasets show EC-LDA outperforms state-of-the-art LDAs, achieving Cos-sim as high as 1.0 in almost all cases. The attack also demonstrates robustness under differential privacy protection.
Conclusion: EC-LDA represents a significant advancement in label distribution attacks on FGL, highlighting serious privacy vulnerabilities. The authors discuss potential defense methods to counter this new threat.
Abstract: Graph Neural Networks (GNNs) have been widely used for graph analysis. Federated Graph Learning (FGL) is an emerging learning framework to collaboratively train graph data from various clients. Although FGL allows client data to remain localized, a malicious server can still steal client private data information through uploaded gradient. In this paper, we for the first time propose label distribution attacks (LDAs) on FGL that aim to infer the label distributions of the client-side data. Firstly, we observe that the effectiveness of LDA is closely related to the variance of node embeddings in GNNs. Next, we analyze the relation between them and propose a new attack named EC-LDA, which significantly improves the attack effectiveness by compressing node embeddings. Then, extensive experiments on node classification and link prediction tasks across six widely used graph datasets show that EC-LDA outperforms the SOTA LDAs. Specifically, EC-LDA can achieve the Cos-sim as high as 1.0 under almost all cases. Finally, we explore the robustness of EC-LDA under differential privacy protection and discuss the potential effective defense methods to EC-LDA. Our code is available at https://github.com/cheng-t/EC-LDA.
[470] Beyond the Pre-Service Horizon: Infusing In-Service Behavior for Improved Financial Risk Forecasting
Senhao Liu, Zhiyu Guo, Zhiyuan Ji, Yueguo Chen, Yateng Tang, Yunhai Wang, Xuehao Zheng, Xiang Ao
Main category: cs.LG
TL;DR: MGKD framework improves pre-service risk prediction by integrating in-service user behavior data through knowledge distillation with multi-granularity strategies.
Details
Motivation: Traditional financial risk management separates pre-service risk assessment and in-service default detection, missing opportunities to leverage in-service data for better pre-service predictions.Method: Multi-Granularity Knowledge Distillation (MGKD) uses teacher-student model approach where teacher trained on in-service data guides student trained on pre-service data through soft labels. Includes coarse-grained, fine-grained, and self-distillation strategies with re-weighting for class imbalance.
Result: Experimental results on Tencent Mobile Payment datasets show effectiveness in both offline and online scenarios.
Conclusion: MGKD successfully transfers key behavioral patterns from in-service to pre-service risk assessment, improving overall performance while addressing class imbalance.
Abstract: Typical financial risk management involves distinct phases for pre-service risk assessment and in-service default detection, often modeled separately. This paper proposes a novel framework, Multi-Granularity Knowledge Distillation (abbreviated as MGKD), aimed at improving pre-service risk prediction through the integration of in-service user behavior data. MGKD follows the idea of knowledge distillation, where the teacher model, trained on historical in-service data, guides the student model, which is trained on pre-service data. By using soft labels derived from in-service data, the teacher model helps the student model improve its risk prediction prior to service activation. Meanwhile, a multi-granularity distillation strategy is introduced, including coarse-grained, fine-grained, and self-distillation, to align the representations and predictions of the teacher and student models. This approach not only reinforces the representation of default cases but also enables the transfer of key behavioral patterns associated with defaulters from the teacher to the student model, thereby improving the overall performance of pre-service risk assessment. Moreover, we adopt a re-weighting strategy to mitigate the model’s bias towards the minority class. Experimental results on large-scale real-world datasets from Tencent Mobile Payment demonstrate the effectiveness of our proposed approach in both offline and online scenarios.
[471] Procedural Environment Generation for Tool-Use Agents
Michael Sullivan, Mareike Hartmann, Alexander Koller
Main category: cs.LG
TL;DR: RandomWorld is a pipeline for generating interactive tools and compositional tool-use data to address the problem of limited training data for LLM tool-use agents, particularly for online RL training.
Details
Motivation: Existing approaches to synthetic tool-use data generation are non-interactive and/or non-compositional, creating a gap in effective training data for LLM tool-use agents.Method: The authors introduce RandomWorld, a procedural generation pipeline that creates interactive tools and compositional tool-use data for training models via Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL).
Result: Models trained on RandomWorld data improve performance on various tool-use benchmarks and achieve new state-of-the-art results on two metrics of the NESTFUL dataset. Performance scales with the amount of synthetic training data.
Conclusion: RandomWorld enables effective training of LLM tool-use agents through synthetic data generation, with potential for further improvements as more synthetic data is used.
Abstract: Although the power of LLM tool-use agents has ignited a flurry of recent research in this area, the curation of tool-use training data remains an open problem$-$especially for online RL training. Existing approaches to synthetic tool-use data generation tend to be non-interactive, and/or non-compositional. We introduce RandomWorld, a pipeline for the procedural generation of interactive tools and compositional tool-use data. We show that models tuned via SFT and RL on synthetic RandomWorld data improve on a range of tool-use benchmarks, and set the new SoTA for two metrics on the NESTFUL dataset. Further experiments show that downstream performance scales with the amount of RandomWorld-generated training data, opening up the possibility of further improvement through the use of entirely synthetic data.
[472] DRES: Fake news detection by dynamic representation and ensemble selection
Faramarz Farhangian, Leandro A. Ensina, George D. C. Cavalcanti, Rafael M. O. Cruz
Main category: cs.LG
TL;DR: DRES is a novel fake news detection method that uses instance hardness measures to dynamically select optimal textual representations and classifier ensembles for each news article, achieving superior performance over state-of-the-art methods.
Details
Motivation: The rapid spread of fake news via social media has significant societal impact, making text-based detection critically important to combat misinformation.Method: DRES leverages instance hardness measures to estimate classification difficulty for each news article across multiple textual feature representations, then dynamically selects the best textual representation and most competent ensemble of classifiers for each instance.
Result: Extensive experiments show DRES achieves notable improvements over state-of-the-art methods in fake news detection accuracy.
Conclusion: The effectiveness of representation selection based on instance hardness and dynamic ensemble selection significantly boosts fake news detection performance.
Abstract: The rapid spread of information via social media has made text-based fake news detection critically important due to its societal impact. This paper presents a novel detection method called Dynamic Representation and Ensemble Selection (DRES) for identifying fake news based solely on text. DRES leverages instance hardness measures to estimate the classification difficulty for each news article across multiple textual feature representations. By dynamically selecting the textual representation and the most competent ensemble of classifiers for each instance, DRES significantly enhances prediction accuracy. Extensive experiments show that DRES achieves notable improvements over state-of-the-art methods, confirming the effectiveness of representation selection based on instance hardness and dynamic ensemble selection in boosting performance. Codes and data are available at: https://github.com/FFarhangian/FakeNewsDetection_DRES
[473] Structure Matters: Brain Graph Augmentation via Learnable Edge Masking for Data-efficient Psychiatric Diagnosis
Mujie Liu, Chenze Wang, Liping Chen, Nguyen Linh Dan Le, Niharika Tewari, Ting Dang, Jiangang Ma, Feng Xia
Main category: cs.LG
TL;DR: SAM-BG is a two-stage self-supervised learning framework that preserves structural semantics in brain graphs for psychiatric diagnosis, using edge masking and structure-aware augmentation to achieve better performance in limited labeled data settings.
Details
Motivation: Limited labeled brain network data makes accurate psychiatric diagnosis challenging, and existing SSL methods often disrupt crucial structural semantics in brain graphs through inappropriate augmentation strategies.Method: A two-stage framework: (1) pre-training stage trains an edge masker on small labeled subset to capture structural semantics, (2) SSL stage uses structural priors to guide structure-aware augmentation for learning semantically meaningful representations.
Result: SAM-BG outperforms state-of-the-art methods on two real-world psychiatric datasets, especially in small-labeled data settings, and uncovers clinically relevant connectivity patterns for enhanced interpretability.
Conclusion: The proposed SAM-BG framework effectively preserves structural semantics in brain graphs, enabling more accurate and interpretable psychiatric diagnoses with limited labeled data.
Abstract: The limited availability of labeled brain network data makes it challenging to achieve accurate and interpretable psychiatric diagnoses. While self-supervised learning (SSL) offers a promising solution, existing methods often rely on augmentation strategies that can disrupt crucial structural semantics in brain graphs. To address this, we propose SAM-BG, a two-stage framework for learning brain graph representations with structural semantic preservation. In the pre-training stage, an edge masker is trained on a small labeled subset to capture key structural semantics. In the SSL stage, the extracted structural priors guide a structure-aware augmentation process, enabling the model to learn more semantically meaningful and robust representations. Experiments on two real-world psychiatric datasets demonstrate that SAM-BG outperforms state-of-the-art methods, particularly in small-labeled data settings, and uncovers clinically relevant connectivity patterns that enhance interpretability. Our code is available at https://github.com/mjliu99/SAM-BG.
[474] FedOne: Query-Efficient Federated Learning for Black-box Discrete Prompt Learning
Ganyu Wang, Jinjie Fang, Maxwell J. Yin, Bin Gu, Xi Chen, Boyu Wang, Yi Chang, Charles Ling
Main category: cs.LG
TL;DR: FedOne is a federated black-box discrete prompt learning framework that optimizes query efficiency by activating only one client per round, significantly reducing costs associated with cloud-based LLM services.
Details
Motivation: Previous federated black-box prompt tuning research neglected the substantial query costs of cloud-based LLM services, creating a need for more query-efficient methods.Method: Theoretical analysis revealed that degrading FedAvg to activate only one client per round (FedOne) achieves optimal query efficiency. The FedOne framework implements this strategy for federated black-box discrete prompt learning.
Result: Numerical experiments demonstrated significant improvement in query efficiency, aligning with theoretical predictions.
Conclusion: FedOne provides an effective solution for query-efficient federated black-box prompt learning, making cloud-based LLM prompt tuning more practical and cost-effective.
Abstract: Black-Box Discrete Prompt Learning is a prompt-tuning method that optimizes discrete prompts without accessing model parameters or gradients, making the prompt tuning on a cloud-based Large Language Model (LLM) feasible. Adapting federated learning to BDPL could further enhance prompt tuning performance by leveraging data from diverse sources. However, all previous research on federated black-box prompt tuning had neglected the substantial query cost associated with the cloud-based LLM service. To address this gap, we conducted a theoretical analysis of query efficiency within the context of federated black-box prompt tuning. Our findings revealed that degrading FedAvg to activate only one client per round, a strategy we called \textit{FedOne}, enabled optimal query efficiency in federated black-box prompt learning. Building on this insight, we proposed the FedOne framework, a federated black-box discrete prompt learning method designed to maximize query efficiency when interacting with cloud-based LLMs. We conducted numerical experiments on various aspects of our framework, demonstrating a significant improvement in query efficiency, which aligns with our theoretical results.
[475] Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework
Jiaqi Weng, Han Zheng, Hanyu Zhang, Qinqin He, Jialing Tao, Hui Xue, Zhixuan Chu, Xiting Wang
Main category: cs.LG
TL;DR: Safe-SAIL is a framework that uses Sparse Autoencoders (SAEs) to interpret safety-related features in large language models, addressing the limitations of current safety research by systematically identifying and explaining safety-critical neurons.
Details
Motivation: Existing safety research focuses on evaluating outputs or specific tasks but fails to address broader undefined risks. SAEs can decompose model behavior into interpretable features, but prior applications haven't adequately addressed safety-critical behaviors like toxic response generation and safety regulation violations.Method: The proposed Safe-SAIL framework systematically identifies SAEs with the best concept-specific interpretability, explains safety-related neurons, and introduces efficient strategies to scale up the interpretation process. It includes a toolkit with SAE checkpoints and human-readable neuron explanations.
Result: The framework enables extraction of a rich and diverse set of safety-relevant features that effectively capture high-risk behaviors in LLMs, overcoming challenges of identifying optimal SAEs and reducing the cost of detailed feature explanation.
Conclusion: Safe-SAIL advances mechanistic understanding of safety domains in LLMs and promotes research on LLM safety through empirical analysis of safety risks, with the release of a comprehensive toolkit to support this research.
Abstract: Increasing deployment of large language models (LLMs) in real-world applications raises significant safety concerns. Most existing safety research focuses on evaluating LLM outputs or specific safety tasks, limiting their ability to address broader, undefined risks. Sparse Autoencoders (SAEs) facilitate interpretability research to clarify model behavior by explaining single-meaning atomic features decomposed from entangled signals. jHowever, prior applications on SAEs do not interpret features with fine-grained safety-related concepts, thus inadequately addressing safety-critical behaviors, such as generating toxic responses and violating safety regulations. For rigorous safety analysis, we must extract a rich and diverse set of safety-relevant features that effectively capture these high-risk behaviors, yet face two challenges: identifying SAEs with the greatest potential for generating safety concept-specific neurons, and the prohibitively high cost of detailed feature explanation. In this paper, we propose Safe-SAIL, a framework for interpreting SAE features within LLMs to advance mechanistic understanding in safety domains. Our approach systematically identifies SAE with best concept-specific interpretability, explains safety-related neurons, and introduces efficient strategies to scale up the interpretation process. We will release a comprehensive toolkit including SAE checkpoints and human-readable neuron explanations, which supports empirical analysis of safety risks to promote research on LLM safety.
[476] UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning
Zhengxi Lu, Jiabo Ye, Fei Tang, Yongliang Shen, Haiyang Xu, Ziwei Zheng, Weiming Lu, Ming Yan, Fei Huang, Jun Xiao, Yueting Zhuang
Main category: cs.LG
TL;DR: Semi-online Reinforcement Learning is a novel paradigm that simulates online RL on offline trajectories to address the limitations of both offline and online RL approaches for GUI agents, achieving state-of-the-art performance on multiple benchmarks.
Details
Motivation: Current GUI agent approaches face a dilemma: offline RL enables stable training but struggles with multi-step task execution due to lack of trajectory-level rewards, while online RL captures these signals but suffers from sparse rewards and high deployment costs.Method: The method simulates online RL on offline trajectories by preserving original model outputs in multi-turn dialogues, using a Patch Module to recover divergence between rollout and expert trajectories. It introduces discounted future returns into reward computation and optimizes policy with weighted step-level and episode-level advantages.
Result: Semi-online RL achieves SOTA performance among 7B models across four dynamic benchmarks, with significant gains over the base model (+12.0% on AndroidWorld, +23.8% on AITW), demonstrating progress in bridging offline training efficiency and online multi-turn reasoning.
Conclusion: The proposed Semi-online RL paradigm effectively addresses the fundamental dilemma in GUI agent training, providing a practical solution that combines the benefits of both offline and online approaches while introducing a new evaluation metric (SOP) that better aligns with true online performance.
Abstract: Graphical User Interface (GUI) agents have demonstrated remarkable progress in automating complex user interface interactions through reinforcement learning. However, current approaches face a fundamental dilemma: offline RL enables stable training on pre-collected trajectories, but struggles with multi-step task execution for lack of trajectory-level reward signals; online RL captures these signals through environment interaction, but suffers from sparse rewards and prohibitive deployment costs. To address it, we present Semi-online Reinforcement Learning, a novel paradigm that simulates online RL on offline trajectories. During each rollout process, we preserve the original model output within the multi-turn dialogue, where a Patch Module adaptively recovers the divergence between rollout and expert trajectories. To capture long-term training signals, Semi-online RL introduces discounted future returns into the reward computation and optimizes the policy with weighted step-level and episode-level advantages. We further introduce Semi-Online Performance (SOP), a metric that aligns better with true online performance, serving as a practical and effective proxy for real-world evaluation. Experiments show that ours Semi-online RL achieves SOTA performance among 7B models across four dynamic benchmarks, with significant gains over the base model (e.g., +12.0% on AndroidWorld, +23.8% on AITW), demonstrating significant progress in bridging the gap between offline training efficiency and online multi-turn reasoning. The code is available at https://github.com/X-PLUG/MobileAgent/tree/main/UI-S1.
[477] DuoGPT: Training-free Dual Sparsity through Activation-aware Pruning in LLMs
Ruokai Yin, Yuhang Li, Donghyun Lee, Priyadarshini Panda
Main category: cs.LG
TL;DR: DuoGPT is a unified framework that combines unstructured weight pruning with activation sparsity to create dual-sparse workloads, achieving better performance than state-of-the-art structured pruning methods while maintaining efficiency.
Details
Motivation: Large language models (LLMs) have high memory and compute costs, making deployment difficult. Most pruning methods ignore activation sparsity observed at runtime, which could be leveraged to reduce demands further.Method: Reinterpret activation sparsity as dynamic structured weight sparsity and propose DuoGPT framework. Extend Optimal Brain Compression (OBC) with activation-aware calibration and introduce output residuals from dense model as correction terms. Optimize for efficient GPU execution to scale to billion-parameter LLMs.
Result: Evaluations on LLaMA-2 and LLaMA-3 show DuoGPT outperforms state-of-the-art structured pruning methods by up to 9.17% accuracy at an iso-speedup of 1.39× compared to baseline dense model.
Conclusion: DuoGPT provides an effective approach to reduce LLM deployment costs by leveraging both weight pruning and activation sparsity, achieving superior accuracy and efficiency compared to existing methods.
Abstract: Large language models (LLMs) deliver strong performance but are difficult to deploy due to high memory and compute costs. While pruning reduces these demands, most methods ignore activation sparsity observed at runtime. We reinterpret activation sparsity as dynamic structured weight sparsity and propose DuoGPT, a unified framework that constructs dual-sparse (spMspV) workloads by combining unstructured weight pruning with activation sparsity. To preserve accuracy, we extend the Optimal Brain Compression (OBC) framework with activation-aware calibration and introduce output residuals from the dense model as correction terms. We further optimize the solution for efficient GPU execution, enabling scalability to billion-parameter LLMs. Evaluations on LLaMA-2 and LLaMA-3 show that DuoGPT outperforms state-of-the-art structured pruning methods by up to 9.17% accuracy at an iso-speedup of 1.39$\times$ compared to the baseline dense model. Code is available at Github.
[478] Exploring Graph-Transformer Out-of-Distribution Generalization Abilities
Itay Niv, Neta Rabin
Main category: cs.LG
TL;DR: This paper investigates out-of-distribution (OOD) generalization in graph neural networks, comparing graph-transformer (GT) and hybrid backbones against traditional message-passing neural networks (MPNNs) under distribution shifts.
Details
Motivation: Current graph learning methods assume training and testing data share the same distribution, which is rarely true in real-world scenarios. While GTs outperform MPNNs in in-distribution settings, their effectiveness under distribution shifts remains unexplored.Method: Systematically evaluate GT and hybrid GT-MPNN backbones in OOD settings, adapt domain generalization algorithms for GTs, and propose a novel post-training analysis approach examining clustering structure of ID and OOD test datasets.
Result: GT and hybrid GT-MPNN backbones demonstrate stronger generalization ability than MPNNs (on 4 out of 6 benchmarks), even without specialized DG algorithms. The proposed analysis method provides valuable insights into generalization abilities beyond standard accuracy metrics.
Conclusion: Graph-transformers show promise for robust, real-world graph learning and set a new direction for future OOD generalization research, with the proposed analysis method being model-agnostic and applicable beyond graph learning.
Abstract: Deep learning on graphs has shown remarkable success across numerous applications, including social networks, bio-physics, traffic networks, and recommendation systems. Regardless of their successes, current methods frequently depend on the assumption that training and testing data share the same distribution, a condition rarely met in real-world scenarios. While graph-transformer (GT) backbones have recently outperformed traditional message-passing neural networks (MPNNs) in multiple in-distribution (ID) benchmarks, their effectiveness under distribution shifts remains largely unexplored. In this work, we address the challenge of out-of-distribution (OOD) generalization for graph neural networks, with a special focus on the impact of backbone architecture. We systematically evaluate GT and hybrid backbones in OOD settings and compare them to MPNNs. To do so, we adapt several leading domain generalization (DG) algorithms to work with GTs and assess their performance on a benchmark designed to test a variety of distribution shifts. Our results reveal that GT and hybrid GT-MPNN backbones demonstrate stronger generalization ability compared to MPNNs, even without specialized DG algorithms (on four out of six benchmarks). Additionally, we propose a novel post-training analysis approach that compares the clustering structure of the entire ID and OOD test datasets, specifically examining domain alignment and class separation. Highlighting its model-agnostic design, the method yielded valuable insights into both GT and MPNN backbones and appears well suited for broader DG applications beyond graph learning, offering a deeper perspective on generalization abilities that goes beyond standard accuracy metrics. Together, our findings highlight the promise of graph-transformers for robust, real-world graph learning and set a new direction for future research in OOD generalization.
[479] Multimodal Atmospheric Super-Resolution With Deep Generative Models
Dibyajyoti Chakraborty, Haiwen Guan, Jason Stock, Troy Arcomano, Guido Cervone, Romit Maulik
Main category: cs.LG
TL;DR: Score-based diffusion models are applied for super-resolution of high-dimensional dynamical systems using sparse sensor measurements, enabling data fusion and uncertainty estimation.
Details
Motivation: To create a novel paradigm for data and model fusion where pretrained score-based diffusion models can be updated with real-time online data in a Bayesian framework for super-resolution tasks.Method: Use score-based diffusion modeling to learn the gradient of log-probability density, reverse noising processes, and enable zero-shot conditioning on observed data for spatiotemporal reconstructions.
Result: Accurate recovery of high-dimensional atmospheric states from multiple low-fidelity measurement sources (ERA5 and IGRA datasets), with the model effectively balancing influence from multiple dataset modalities.
Conclusion: Score-based diffusion models provide an effective framework for super-resolution and data fusion in high-dimensional dynamical systems, with demonstrated capability to handle multimodal data sources and provide uncertainty estimates.
Abstract: Score-based diffusion modeling is a generative machine learning algorithm that can be used to sample from complex distributions. They achieve this by learning a score function, i.e., the gradient of the log-probability density of the data, and reversing a noising process using the same. Once trained, score-based diffusion models not only generate new samples but also enable zero-shot conditioning of the generated samples on observed data. This promises a novel paradigm for data and model fusion, wherein the implicitly learned distributions of pretrained score-based diffusion models can be updated given the availability of online data in a Bayesian formulation. In this article, we apply such a concept to the super-resolution of a high-dimensional dynamical system, given the real-time availability of low-resolution and experimentally observed sparse sensor measurements from multimodal data. Additional analysis on how score-based sampling can be used for uncertainty estimates is also provided. Our experiments are performed for a super-resolution task that generates the ERA5 atmospheric dataset given sparse observations from a coarse-grained representation of the same and/or from unstructured experimental observations of the IGRA radiosonde dataset. We demonstrate accurate recovery of the high dimensional state given multiple sources of low-fidelity measurements. We also discover that the generative model can balance the influence of multiple dataset modalities during spatiotemporal reconstructions.
[480] Self-Evolving LLMs via Continual Instruction Tuning
Jiazheng Kang, Le Huang, Cheng Hou, Zhe Zhao, Zhenxiang Yan, Chuan Shi, Ting Bai
Main category: cs.LG
TL;DR: MoE-CL is a parameter-efficient adversarial mixture-of-experts framework for continual instruction tuning of LLMs that addresses catastrophic forgetting through dedicated task-specific experts and a shared expert with adversarial training.
Details
Motivation: Large language models in industrial settings need continual learning to adapt to evolving tasks, but existing approaches suffer from catastrophic forgetting where training on new tasks degrades performance on earlier ones.Method: Uses dual-expert design: dedicated LoRA expert per task to preserve task-specific knowledge, and shared LoRA expert for cross-task transfer with a task-aware discriminator in a GAN framework to filter task-irrelevant noise.
Result: Extensive experiments on MTL5 and Tencent3 benchmarks show effectiveness, with real-world A/B testing on Tencent Video platform reducing manual review costs by 15.3%.
Conclusion: MoE-CL is practical for large-scale industrial deployment where continual adaptation and stable transfer are critical, balancing knowledge retention and cross-task generalization.
Abstract: In real-world industrial settings, large language models (LLMs) must learn continually to keep pace with diverse and evolving tasks, requiring self-evolution to refine knowledge under dynamic data distributions. However, existing continual learning (CL) approaches, such as replay and parameter isolation, often suffer from catastrophic forgetting: training on new tasks degrades performance on earlier ones by overfitting to the new distribution and weakening generalization.We propose MoE-CL, a parameter-efficient adversarial mixture-of-experts framework for industrial-scale, self-evolving continual instruction tuning of LLMs. MoE-CL uses a dual-expert design: (1) a dedicated LoRA expert per task to preserve task-specific knowledge via parameter independence, mitigating forgetting; and (2) a shared LoRA expert to enable cross-task transfer. To prevent transferring task-irrelevant noise through the shared pathway, we integrate a task-aware discriminator within a GAN. The discriminator encourages the shared expert to pass only task-aligned information during sequential training. Through adversarial learning, the shared expert acquires generalized representations that mimic the discriminator, while dedicated experts retain task-specific details, balancing knowledge retention and cross-task generalization and thereby supporting self-evolution.Extensive experiments on the public MTL5 benchmark and an industrial Tencent3 benchmark validate the effectiveness of MoE-CL for continual instruction tuning. In real-world A/B testing for content compliance review on the Tencent Video platform, MoE-CL reduced manual review costs by 15.3%. These results demonstrate that MoE-CL is practical for large-scale industrial deployment where continual adaptation and stable transfer are critical.
[481] GradNetOT: Learning Optimal Transport Maps with GradNets
Shreyas Chaudhari, Srinivasa Pranav, José M. F. Moura
Main category: cs.LG
TL;DR: This paper proposes using Monotone Gradient Networks (mGradNets) to learn optimal transport maps by minimizing a loss function based on the Monge-Ampère equation, showing effectiveness in image morphing and high-dimensional OT problems.
Details
Motivation: Monotone gradient functions are crucial for solving the Monge formulation of optimal transport problems, which have applications in fluid dynamics and robot swarm control. Brenier's theorem ensures that the optimal transport map is the gradient of a convex function when using squared Euclidean distance.Method: The authors leverage mGradNets, neural networks that parameterize monotone gradient maps, to directly learn the optimal transport mapping by minimizing a training loss function defined using the Monge-Ampère equation.
Result: Empirical results demonstrate that the structural bias of mGradNets facilitates learning optimal transport maps in both image morphing tasks and high-dimensional optimal transport problems.
Conclusion: mGradNets provide an effective approach for learning optimal transport maps by directly parameterizing monotone gradient functions and leveraging the Monge-Ampère equation, showing promise for various applications.
Abstract: Monotone gradient functions play a central role in solving the Monge formulation of the optimal transport (OT) problem, which arises in modern applications ranging from fluid dynamics to robot swarm control. When the transport cost is the squared Euclidean distance, Brenier’s theorem guarantees that the unique optimal transport map satisfies a Monge-Amp`ere equation and is the gradient of a convex function. In [arXiv:2301.10862] [arXiv:2404.07361], we proposed Monotone Gradient Networks (mGradNets), neural networks that directly parameterize the space of monotone gradient maps. In this work, we leverage mGradNets to directly learn the optimal transport mapping by minimizing a training loss function defined using the Monge-Amp`ere equation. We empirically show that the structural bias of mGradNets facilitates the learning of optimal transport maps across both image morphing tasks and high-dimensional OT problems.
[482] FedRAIN-Lite: Federated Reinforcement Algorithms for Improving Idealised Numerical Weather and Climate Models
Pritthijit Nath, Sebastian Schemm, Henry Moss, Peter Haynes, Emily Shuckburgh, Mark Webb
Main category: cs.LG
TL;DR: FedRAIN-Lite is a federated reinforcement learning framework that enables adaptive sub-grid parameter learning in climate models by assigning RL agents to latitude bands with periodic global aggregation.
Details
Motivation: Traditional sub-grid parameterisations in climate models are static and tuned offline, limiting their adaptability to evolving climate states, which creates a need for more adaptive learning approaches.Method: Uses a hierarchy of simplified energy-balance climate models (ebm-v1 to ebm-v3) to benchmark three RL algorithms under different FedRL configurations, with Deep Deterministic Policy Gradient (DDPG) showing superior performance.
Result: DDPG consistently outperforms static and single-agent baselines with faster convergence and lower area-weighted RMSE in tropical and mid-latitude zones across both multi-agent ensemble and GCM-like setups.
Conclusion: The framework provides a scalable pathway towards high-complexity GCMs and offers a prototype for physically aligned, online-learning climate models that can evolve with a changing climate.
Abstract: Sub-grid parameterisations in climate models are traditionally static and tuned offline, limiting adaptability to evolving states. This work introduces FedRAIN-Lite, a federated reinforcement learning (FedRL) framework that mirrors the spatial decomposition used in general circulation models (GCMs) by assigning agents to latitude bands, enabling local parameter learning with periodic global aggregation. Using a hierarchy of simplified energy-balance climate models, from a single-agent baseline (ebm-v1) to multi-agent ensemble (ebm-v2) and GCM-like (ebm-v3) setups, we benchmark three RL algorithms under different FedRL configurations. Results show that Deep Deterministic Policy Gradient (DDPG) consistently outperforms both static and single-agent baselines, with faster convergence and lower area-weighted RMSE in tropical and mid-latitude zones across both ebm-v2 and ebm-v3 setups. DDPG’s ability to transfer across hyperparameters and low computational cost make it well-suited for geographically adaptive parameter learning. This capability offers a scalable pathway towards high-complexity GCMs and provides a prototype for physically aligned, online-learning climate models that can evolve with a changing climate. Code accessible at https://github.com/p3jitnath/climate-rl-fedrl.
[483] FORGE: Foundational Optimization Representations from Graph Embeddings
Zohair Shafi, Serdar Kadioglu
Main category: cs.LG
TL;DR: Forge is a framework that pre-trains a vector-quantized graph autoencoder on diverse mixed-integer programming instances to create foundational optimization representations that generalize across problem domains and sizes.
Details
Motivation: Existing learning-based optimization methods require training dedicated models for each problem distribution and downstream task, which is computationally expensive and lacks scalability and generalization.Method: Unsupervised pre-training of vector-quantized graph autoencoder on large collection of MIP instances without relying on solvers or optimal solutions, using discrete code assignments as vocabulary for optimization representations.
Result: Forge embeddings effectively cluster unseen instances across domains and sizes in unsupervised setting. In supervised tasks, fine-tuned embeddings improve solver performance for cut-generation and variable hints, outperforming state-of-the-art methods.
Conclusion: Forge provides a foundational approach for optimization representation learning that generalizes well across different problems and tasks, enabling more scalable and efficient combinatorial optimization.
Abstract: Combinatorial optimization problems are ubiquitous in science and engineering. Still, learning-based approaches to accelerate combinatorial optimization often require solving a large number of difficult instances to collect training data, incurring significant computational cost. Existing learning-based methods require training dedicated models for each problem distribution, for each downstream task, severely limiting their scalability and generalization. We introduce Forge: Foundational Optimization Representations from Graph Embeddings, a framework that pre-trains a vector-quantized graph autoencoder on a large, diverse collection of mixed-integer programming (MIP) instances in an unsupervised manner, without relying on optimization solvers or optimal solutions. Vector quantization produces discrete code assignments that serve as a vocabulary for representing optimization instances. We evaluate Forge in both unsupervised and supervised settings. In the unsupervised setting, Forge embeddings effectively cluster unseen instances across problem domains and sizes. In the supervised setting, we fine-tune Forge embeddings and show that a single pre-trained model helps predicting both the integrality gap for cut-generation and variable hints for search guidance across multiple problem and size distributions. In both tasks, we improve the performance of a commercial optimization solver and outperform state-of-the-art learning-based methods. Finally, we open-source our training code, pre-trained Forge weights, and embeddings for multiple MIP distributions to foster further research in representation learning for optimization problems.
[484] Fisher information flow in artificial neural networks
Maximilian Weimar, Lukas M. Rachbauer, Ilya Starshynov, Daniele Faccio, Linara Adilova, Dorian Bouchet, Stefan Rotter
Main category: cs.LG
TL;DR: A method to monitor Fisher information flow through neural networks during parameter estimation, showing optimal performance corresponds to maximal Fisher information transmission and providing a model-free stopping criterion for training.
Details
Motivation: With neural networks becoming integral to measurement systems, understanding how they process parameter-relevant information internally is essential for optimizing estimation performance.Method: Presented a method to track Fisher information flow from input to output layers in ANNs performing parameter estimation tasks, using Fisher information as a metric to monitor information transmission.
Result: Showed that optimal estimation performance corresponds to maximal Fisher information transmission, and training beyond this point causes information loss due to overfitting.
Conclusion: The approach provides a model-free stopping criterion for network training that eliminates the need for validation datasets, demonstrated effective in realistic physical settings like imaging experiments.
Abstract: The estimation of continuous parameters from measured data plays a central role in many fields of physics. A key tool in understanding and improving such estimation processes is the concept of Fisher information, which quantifies how information about unknown parameters propagates through a physical system and determines the ultimate limits of precision. With Artificial Neural Networks (ANNs) gradually becoming an integral part of many measurement systems, it is essential to understand how they process and transmit parameter-relevant information internally. Here, we present a method to monitor the flow of Fisher information through an ANN performing a parameter estimation task, tracking it from the input to the output layer. We show that optimal estimation performance corresponds to the maximal transmission of Fisher information, and that training beyond this point results in information loss due to overfitting. This provides a model-free stopping criterion for network training-eliminating the need for a separate validation dataset. To demonstrate the practical relevance of our approach, we apply it to a network trained on data from an imaging experiment, highlighting its effectiveness in a realistic physical setting.
[485] Improving Monte Carlo Tree Search for Symbolic Regression
Zhengyao Huang, Daniel Zhengyu Huang, Tiannan Xiao, Dina Ma, Zhenyu Ming, Hao Shi, Yuanhui Wen
Main category: cs.LG
TL;DR: An improved MCTS framework for symbolic regression with extreme bandit allocation and evolution-inspired state-jumping actions that achieves competitive performance with state-of-the-art methods.
Details
Motivation: Traditional MCTS approaches for symbolic regression have limitations in bandit strategies and sequential symbol construction, which restrict their performance in discovering optimal mathematical expressions.Method: Proposes two key innovations: (1) extreme bandit allocation strategy with finite-time performance guarantees under polynomial reward decay, and (2) evolution-inspired state-jumping actions (mutation and crossover) that enable non-local transitions and reshape the reward landscape.
Result: The approach achieves competitive performance with state-of-the-art symbolic regression libraries in terms of recovery rate and obtains favorable positions on the Pareto frontier of accuracy versus model complexity across various datasets.
Conclusion: The improved MCTS framework successfully addresses limitations of traditional approaches through novel bandit strategies and evolutionary operators, demonstrating enhanced robustness and efficiency in symbolic expression discovery.
Abstract: Symbolic regression aims to discover concise, interpretable mathematical expressions that satisfy desired objectives, such as fitting data, posing a highly combinatorial optimization problem. While genetic programming has been the dominant approach, recent efforts have explored reinforcement learning methods for improving search efficiency. Monte Carlo Tree Search (MCTS), with its ability to balance exploration and exploitation through guided search, has emerged as a promising technique for symbolic expression discovery. However, its traditional bandit strategies and sequential symbol construction often limit performance. In this work, we propose an improved MCTS framework for symbolic regression that addresses these limitations through two key innovations: (1) an extreme bandit allocation strategy tailored for identifying globally optimal expressions, with finite-time performance guarantees under polynomial reward decay assumptions; and (2) evolution-inspired state-jumping actions such as mutation and crossover, which enable non-local transitions to promising regions of the search space. These state-jumping actions also reshape the reward landscape during the search process, improving both robustness and efficiency. We conduct a thorough numerical study to the impact of these improvements and benchmark our approach against existing symbolic regression methods on a variety of datasets, including both ground-truth and black-box datasets. Our approach achieves competitive performance with state-of-the-art libraries in terms of recovery rate, attains favorable positions on the Pareto frontier of accuracy versus model complexity. Code is available at https://github.com/PKU-CMEGroup/MCTS-4-SR.
[486] Causality-Induced Positional Encoding for Transformer-Based Representation Learning of Non-Sequential Features
Kaichen Xu, Yihang Du, Mianpeng Liu, Zimu Yu, Xiaobo Sun
Main category: cs.LG
TL;DR: CAPE is a novel positional encoding method that identifies causal structure in non-sequential features using DAGs and embeds them in hyperbolic space to create causality-aware positional encodings for transformers.
Details
Motivation: Existing positional encoding methods require predefined token order, making them unsuitable for real-world data with non-sequential but causally-related features.Method: CAPE identifies causal structure as weighted DAGs using generalized structural equation modeling, embeds DAGs in hyperbolic space to preserve geometric structure, and converts them into rotary positional encodings for transformer integration.
Result: Theoretical analysis shows CAPE encodings possess valuable properties (causal distance attenuation, causal generality attenuation, robustness to disturbances). Empirical evaluation on synthetic and real datasets demonstrates effectiveness.
Conclusion: CAPE effectively enhances transformers for data with non-sequential features by providing causality-aware positional encodings that capture important causal graph properties.
Abstract: Positional encoding is essential for supplementing transformer with positional information of tokens. Existing positional encoding methods demand predefined token/feature order, rendering them unsuitable for real-world data with non-sequential yet causally-related features. To address this limitation, we propose CAPE, a novel method that identifies underlying causal structure over non-sequential features as a weighted directed acyclic graph (DAG) using generalized structural equation modeling. The DAG is then embedded in hyperbolic space where its geometric structure is well-preserved using a hyperboloid model-based approach that effectively captures two important causal graph properties (causal strength & causal specificity). This step yields causality-aware positional encodings for the features, which are converted into their rotary form for integrating with transformer’s self-attention mechanism. Theoretical analysis reveals that CAPE-generated rotary positional encodings possess three valuable properties for enhanced self-attention, including causal distance-induced attenuation, causal generality-induced attenuation, and robustness to positional disturbances. We evaluate CAPE over both synthetic and real-word datasets, empirically demonstrating its theoretical properties and effectiveness in enhancing transformer for data with non-sequential features. Our code is available at https://github.com/Catchxu/CAPE.
[487] A Generative Conditional Distribution Equality Testing Framework and Its Minimax Analysis
Siming Zheng, Meifang Lan, Tong Wang, Yuanyuan Lin
Main category: cs.LG
TL;DR: A framework for testing equality of conditional distributions in two-sample problems using neural network-based generative methods and sample splitting, with theoretical guarantees and empirical validation.
Details
Motivation: Addressing the need for robust testing methods in transfer learning under covariate shift, particularly for comparing conditional distributions.Method: Transforms conditional distribution testing into unconditional testing using neural network-based generative methods and sample splitting. Introduces two tests: generative permutation-based and generative classification accuracy-based conditional distribution equality tests.
Result: Establishes minimax lower bounds, shows tests can attain optimal rates, proves testing consistency, and demonstrates effectiveness on synthetic and real-world datasets.
Conclusion: The proposed framework provides statistically sound and practically effective methods for conditional distribution testing with strong theoretical guarantees.
Abstract: In this paper, we propose a general framework for testing the equality of the conditional distributions in a two-sample problem. This problem is most relevant to transfer learning under covariate shift. Our framework is built on neural network-based generative methods and sample splitting techniques by transforming the conditional distribution testing problem into an unconditional one. We introduce two special tests: the generative permutation-based conditional distribution equality test and the generative classification accuracy-based conditional distribution equality test. Theoretically, we establish a minimax lower bound for statistical inference in testing the equality of two conditional distributions under certain smoothness conditions. We demonstrate that the generative permutation-based conditional distribution equality test and its modified version can attain this lower bound precisely or up to some iterated logarithmic factor. Moreover, we prove the testing consistency of the generative classification accuracy-based conditional distribution equality test. We also establish the convergence rate for the learned conditional generator by deriving new results related to the recently-developed offset Rademacher complexity and approximation properties using neural networks. Empirically, we conduct numerical studies including synthetic datasets and two real-world datasets, demonstrating the effectiveness of our approach.
[488] Explicit Path CGR: Maintaining Sequence Fidelity in Geometric Representations
Sarwan Ali
Main category: cs.LG
TL;DR: A novel Chaos Game Representation method called R-CGR that preserves complete sequence information through explicit path encoding and rational arithmetic, enabling perfect sequence reconstruction from geometric traces.
Details
Motivation: Traditional CGR approaches lose sequence information during geometric mapping, limiting their utility in bioinformatics applications where both accuracy and sequence recovery are essential.Method: Introduces complete sequence recovery through explicit path encoding combined with rational arithmetic precision control, maintaining both positional and character information at each step for perfect reversibility.
Result: Demonstrated effectiveness on biological sequence classification tasks, achieving competitive performance compared to traditional sequence-based methods while providing interpretable geometric visualizations.
Conclusion: R-CGR opens new avenues for interpretable bioinformatics analysis by generating feature-rich images suitable for deep learning while maintaining complete sequence information through explicit encoding.
Abstract: We present a novel information-preserving Chaos Game Representation (CGR) method, also called Reverse-CGR (R-CGR), for biological sequence analysis that addresses the fundamental limitation of traditional CGR approaches - the loss of sequence information during geometric mapping. Our method introduces complete sequence recovery through explicit path encoding combined with rational arithmetic precision control, enabling perfect sequence reconstruction from stored geometric traces. Unlike purely geometric approaches, our reversibility is achieved through comprehensive path storage that maintains both positional and character information at each step. We demonstrate the effectiveness of R-CGR on biological sequence classification tasks, achieving competitive performance compared to traditional sequence-based methods while providing interpretable geometric visualizations. The approach generates feature-rich images suitable for deep learning while maintaining complete sequence information through explicit encoding, opening new avenues for interpretable bioinformatics analysis where both accuracy and sequence recovery are essential.
[489] DS-Diffusion: Data Style-Guided Diffusion Model for Time-Series Generation
Mingchun Sun, Rongqiang Zhao, Hengrui Hu, Songyu Ding, Jie Liu
Main category: cs.LG
TL;DR: DS-Diffusion is a novel time series generation model that uses style-guided kernels to avoid retraining for specific conditions, reduces distributional bias through hierarchical denoising, and provides interpretable inference.
Details
Motivation: Existing diffusion models for time series generation require full retraining for conditional guidance, suffer from distributional bias between generated and real data, and have uninterpretable inference processes.Method: Proposes DS-Diffusion with: 1) diffusion framework based on style-guided kernels to avoid retraining, 2) time-information based hierarchical denoising (THD) mechanism to reduce distributional bias, and 3) clear style indication for interpretability.
Result: Compared to state-of-the-art ImagenTime: predictive score decreases by 5.56%, discriminative score decreases by 61.55%. Distributional bias is reduced and inference process is more interpretable.
Conclusion: DS-Diffusion enhances flexibility and adaptability by eliminating retraining needs, reduces distributional bias, and provides more interpretable time series generation.
Abstract: Diffusion models are the mainstream approach for time series generation tasks. However, existing diffusion models for time series generation require retraining the entire framework to introduce specific conditional guidance. There also exists a certain degree of distributional bias between the generated data and the real data, which leads to potential model biases in downstream tasks. Additionally, the complexity of diffusion models and the latent spaces leads to an uninterpretable inference process. To address these issues, we propose the data style-guided diffusion model (DS-Diffusion). In the DS-Diffusion, a diffusion framework based on style-guided kernels is developed to avoid retraining for specific conditions. The time-information based hierarchical denoising mechanism (THD) is developed to reduce the distributional bias between the generated data and the real data. Furthermore, the generated samples can clearly indicate the data style from which they originate. We conduct comprehensive evaluations using multiple public datasets to validate our approach. Experimental results show that, compared to the state-of-the-art model such as ImagenTime, the predictive score and the discriminative score decrease by 5.56% and 61.55%, respectively. The distributional bias between the generated data and the real data is further reduced, the inference process is also more interpretable. Moreover, by eliminating the need to retrain the diffusion model, the flexibility and adaptability of the model to specific conditions are also enhanced.
[490] Improving Credit Card Fraud Detection through Transformer-Enhanced GAN Oversampling
Kashaf Ul Emaan
Main category: cs.LG
TL;DR: A hybrid GAN-Transformer approach for generating realistic fraudulent transaction samples to address class imbalance in credit card fraud detection, outperforming traditional methods like SMOTE and other generative models.
Details
Motivation: Credit card fraud detection faces severe class imbalance issues where fraud cases are extremely rare. Traditional oversampling methods like SMOTE create simplistic synthetic samples, while recent generative models (CTGAN, TVAE) still struggle with high-dimensional dependence modeling.Method: Proposes a hybrid approach combining Generative Adversarial Network (GAN) with Transformer encoder blocks. The GAN enables adversarial training for realistic sample generation, while the Transformer’s self-attention mechanism learns rich feature interactions to overcome limitations of existing methods.
Result: Tested on the Credit Card Fraud Detection dataset, the Transformer-based GAN showed substantial improvements in Recall, F1-score, and AUC compared to conventional and generative resampling strategies across multiple classifiers (Logistic Regression, Random Forest, XGBoost, SVM).
Conclusion: The hybrid GAN-Transformer approach effectively overcomes severe class imbalance in fraud detection by producing high-quality synthetic minority class samples, demonstrating superior performance over existing methods.
Abstract: Detection of credit card fraud is an acute issue of financial security because transaction datasets are highly lopsided, with fraud cases being only a drop in the ocean. Balancing datasets using the most popular methods of traditional oversampling such as the Synthetic Minority Oversampling Technique (SMOTE) generally create simplistic synthetic samples that are not readily applicable to complex fraud patterns. Recent industry advances that include Conditional Tabular Generative Adversarial Networks (CTGAN) and Tabular Variational Autoencoders (TVAE) have demonstrated increased efficiency in tabular synthesis, yet all these models still exhibit issues with high-dimensional dependence modelling. Now we will present our hybrid approach where we use a Generative Adversarial Network (GAN) with a Transformer encoder block to produce realistic fraudulent transactions samples. The GAN architecture allows training realistic generators adversarial, and the Transformer allows the model to learn rich feature interactions by self-attention. Such a hybrid strategy overcomes the limitations of SMOTE, CTGAN, and TVAE by producing a variety of high-quality synthetic minority classes samples. We test our algorithm on the publicly-available Credit Card Fraud Detection dataset and compare it to conventional and generative resampling strategies with a variety of classifiers, such as Logistic Regression (LR), Random Forest (RF), Extreme Gradient Boosting (XGBoost), and Support Vector Machine (SVM). Findings indicate that our Transformer-based GAN shows substantial gains in Recall, F1-score and Area Under the Receiver Operating Characteristic Curve (AUC), which indicates that it is effective in overcoming the severe class imbalance inherent in the task of fraud detection.
[491] Unveiling the Role of Learning Rate Schedules via Functional Scaling Laws
Binghui Li, Fengling Chen, Zixun Huang, Lean Wang, Lei Wu
Main category: cs.LG
TL;DR: This paper introduces the Functional Scaling Law (FSL) framework to model loss dynamics during LLM training, capturing the impact of learning rate schedules through a teacher-student kernel regression setup and SDE modeling.
Details
Motivation: Existing scaling laws focus only on final-step loss, ignoring training dynamics and learning rate schedule effects. The authors aim to bridge this gap by studying how different learning rate schedules affect training efficiency and loss evolution.Method: Uses a teacher-student kernel regression setup trained via online SGD, employing intrinsic time viewpoint and SDE modeling to derive the Functional Scaling Law that characterizes population risk evolution for general learning rate schedules.
Result: FSL captures learning rate schedule effects through an explicit convolution-type functional term. The framework theoretically justifies empirical practices like higher-capacity models being more efficient, learning rate decay improving efficiency, and warmup-stable-decay schedules outperforming direct-decay.
Conclusion: FSL provides a comprehensive framework for understanding LLM pre-training dynamics, offering practical utility for fitting, predicting, and optimizing loss curves across model sizes from 0.1B to 1B parameters.
Abstract: Scaling laws have played a cornerstone role in guiding the training of large language models (LLMs). However, most existing works on scaling laws primarily focus on the final-step loss, overlooking the loss dynamics during the training process and, crucially, the impact of learning rate schedule (LRS). In this paper, we aim to bridge this gap by studying a teacher-student kernel regression setup trained via online stochastic gradient descent (SGD). Leveraging a novel intrinsic time viewpoint and stochastic differential equation (SDE) modeling of SGD, we introduce the Functional Scaling Law (FSL), which characterizes the evolution of population risk during the training process for general LRSs. Remarkably, the impact of the LRSs is captured through an explicit convolution-type functional term, making their effects fully tractable. To illustrate the utility of FSL, we analyze three widely used LRSs – constant, exponential decay, and warmup-stable-decay (WSD) – under both data-limited and compute-limited regimes. We provide theoretical justification for widely adopted empirical practices in LLMs pre-training such as (i) higher-capacity models are more data- and compute-efficient; (ii) learning rate decay can improve training efficiency; (iii) WSD-like schedules can outperform direct-decay schedules. Lastly, we explore the practical relevance of FSL as a surrogate model for fitting, predicting and optimizing the loss curves in LLM pre-training, with experiments conducted across model sizes ranging from 0.1B to 1B parameters. We hope our FSL framework can deepen the understanding of LLM pre-training dynamics and provide insights for improving large-scale model training.
cs.MA
[492] The Heterogeneous Multi-Agent Challenge
Charles Dansereau, Junior-Samuel Lopez-Yepez, Karthik Soma, Antoine Fagette
Main category: cs.MA
TL;DR: The paper identifies a lack of standardized testbeds for cooperative Heterogeneous Multi-Agent Reinforcement Learning (HeMARL), which hinders progress in this underexplored but important research area.
Details
Motivation: Heterogeneous Multi-Agent Reinforcement Learning (HeMARL) addresses real-world scenarios where agents have different capabilities, but current research lacks standardized benchmarks like those available for homogeneous MARL and single-agent RL.Method: The paper proposes the need for establishing standardized environments and benchmarks specifically designed for cooperative HeMARL to enable proper evaluation and comparison of algorithms.
Result: The analysis reveals that current HeMARL research often uses overly simple environments or weakly heterogeneous settings where most algorithms perform optimally, limiting meaningful progress measurement.
Conclusion: There is a critical need to develop standardized testbeds for cooperative HeMARL to advance research in this important domain and enable proper benchmarking of heterogeneous agent algorithms.
Abstract: Multi-Agent Reinforcement Learning (MARL) is a growing research area which gained significant traction in recent years, extending Deep RL applications to a much wider range of problems. A particularly challenging class of problems in this domain is Heterogeneous Multi-Agent Reinforcement Learning (HeMARL), where agents with different sensors, resources, or capabilities must cooperate based on local information. The large number of real-world situations involving heterogeneous agents makes it an attractive research area, yet underexplored, as most MARL research focuses on homogeneous agents (e.g., a swarm of identical robots). In MARL and single-agent RL, standardized environments such as ALE and SMAC have allowed to establish recognized benchmarks to measure progress. However, there is a clear lack of such standardized testbed for cooperative HeMARL. As a result, new research in this field often uses simple environments, where most algorithms perform near optimally, or uses weakly heterogeneous MARL environments.
[493] Knowledge Base-Aware Orchestration: A Dynamic, Privacy-Preserving Method for Multi-Agent Systems
Danilo Trombino, Vincenzo Pecorella, Alessandro de Giulii, Davide Tresoldi
Main category: cs.MA
TL;DR: KBA Orchestration enhances multi-agent systems by using dynamic, privacy-preserving relevance signals from agents’ knowledge bases to improve task routing accuracy and efficiency.
Details
Motivation: Static agent descriptions in multi-agent systems become outdated and incomplete, leading to inefficient task routing in dynamic environments where agent capabilities evolve continuously.Method: Introduces Knowledge Base-Aware (KBA) Orchestration that augments static descriptions with dynamic relevance signals. When static descriptions are insufficient, the orchestrator prompts subagents in parallel to assess task relevance against their private KBs, returning lightweight ACK signals without exposing underlying data.
Result: Benchmarks show KBA Orchestration significantly outperforms static description-driven methods in routing precision and overall system efficiency.
Conclusion: The method achieves more accurate and adaptive task routing while preserving agent autonomy and data confidentiality, making it suitable for large-scale systems requiring higher accuracy than standard description-driven routing.
Abstract: Multi-agent systems (MAS) are increasingly tasked with solving complex, knowledge-intensive problems where effective agent orchestration is critical. Conventional orchestration methods rely on static agent descriptions, which often become outdated or incomplete. This limitation leads to inefficient task routing, particularly in dynamic environments where agent capabilities continuously evolve. We introduce Knowledge Base-Aware (KBA) Orchestration, a novel approach that augments static descriptions with dynamic, privacy-preserving relevance signals derived from each agent’s internal knowledge base (KB). In the proposed framework, when static descriptions are insufficient for a clear routing decision, the orchestrator prompts the subagents in parallel. Each agent then assesses the task’s relevance against its private KB, returning a lightweight ACK signal without exposing the underlying data. These collected signals populate a shared semantic cache, providing dynamic indicators of agent suitability for future queries. By combining this novel mechanism with static descriptions, our method achieves more accurate and adaptive task routing preserving agent autonomy and data confidentiality. Benchmarks show that our KBA Orchestration significantly outperforms static description-driven methods in routing precision and overall system efficiency, making it suitable for large-scale systems that require higher accuracy than standard description-driven routing.
[494] Optimal Multi-agent Path Finding in Continuous Time
Alvin Combrink, Sabino Francesco Roselli, Martin Fabian
Main category: cs.MA
TL;DR: This paper presents an analytical framework for CCBS algorithms in continuous-time multi-agent path finding, identifies flaws in the reference implementation, and introduces a new branching rule (δ-BR) that restores soundness, optimality, and termination guarantees.
Details
Motivation: The reference CCBS implementation can fail to terminate on solvable problems and return sub-optimal solutions, despite CCBS being viewed as the standard optimal baseline for continuous-time multi-agent path finding.Method: The authors develop an analytical framework with sufficient conditions for CCBS-style algorithms, then introduce a new branching rule (δ-BR) that satisfies these conditions. The framework provides systematic evaluation tools for CCBS-like solvers.
Result: CCBS with δ-BR improves solution quality (16% lower sum-of-costs in one example) and guarantees termination and optimality, while the reference CCBS is faster but occasionally sub-optimal and may not terminate.
Conclusion: The δ-BR branching rule can be adopted as a drop-in replacement in existing codebases, and the analytical framework provides tools for rigorous analysis of next-generation MAPFR algorithms.
Abstract: Continuous-time Conflict Based-Search (CCBS) has long been viewed as the standard optimal baseline for multi-agent path finding in continuous time (MAPFR), yet recent critiques show that the theoretically described CCBS can fail to terminate on solvable MAPFR problems while the publicly available reference implementation can return sub-optimal solutions. This work presents an analytical framework that yields simple and sufficient conditions under which any CCBS-style algorithm is both sound and solution complete. Investigating the reference CCBS implementation reveals that it violates our sufficient conditions for soundness, with counterexamples demonstrating sub-optimality. Leveraging the framework, we introduce a branching rule ($\delta$-BR) and prove it restores soundness and termination guarantees. Consequently, the resulting CCBS variant is both sound and solution complete. To our knowledge, this is the first MAPFR solver matching the guarantees of the discrete-time CBS. On a constructed example, CCBS with $\delta$-BR improves sum-of-costs from 10.707 to 9.000 ($\approx$ 16% lower) compared to the reference CCBS implementation. Across benchmarks, the reference CCBS implementation is generally able to find solutions faster than CCBS with $\delta$-BR due to its more aggressive pruning. However, this comes at the cost of occasional sub-optimality and potential non-termination when all solutions are pruned, whereas $\delta$-BR preserves optimality and guarantees termination by design. Because $\delta$-BR largely only affects the branching step, it can be adopted as a drop-in replacement in existing codebases. Beyond CCBS, the analytical framework and termination criterion provide a systematic way to evaluate other CCBS-like MAPFR solvers and future extensions, thereby offering tools for rigorous analysis of next-generation MAPFR algorithms.
[495] PromptSculptor: Multi-Agent Based Text-to-Image Prompt Optimization
Dawei Xiang, Wenyan Xu, Kexin Chu, Tianqi Ding, Zixu Shen, Yiming Zeng, Jianchang Su, Wei Zhang
Main category: cs.MA
TL;DR: PromptSculptor is a multi-agent framework that automates iterative prompt optimization for Text-to-Image models, transforming vague user prompts into comprehensive prompts through collaborative specialized agents.
Details
Motivation: Current Text-to-Image models require detailed prompts and multiple refinement rounds to generate high-quality images, which is time-consuming and requires expertise.Method: A four-agent framework using Chain-of-Thought reasoning: specialized agents collaboratively transform vague prompts into refined ones, with self-evaluation and feedback-tuning agents for iterative refinement.
Result: Experimental results show PromptSculptor significantly enhances output quality, reduces iterations needed for user satisfaction, and works seamlessly with various T2I models.
Conclusion: The model-agnostic framework enables efficient prompt optimization, paving the way for industrial applications by democratizing access to high-quality image generation.
Abstract: The rapid advancement of generative AI has democratized access to powerful tools such as Text-to-Image models. However, to generate high-quality images, users must still craft detailed prompts specifying scene, style, and context-often through multiple rounds of refinement. We propose PromptSculptor, a novel multi-agent framework that automates this iterative prompt optimization process. Our system decomposes the task into four specialized agents that work collaboratively to transform a short, vague user prompt into a comprehensive, refined prompt. By leveraging Chain-of-Thought reasoning, our framework effectively infers hidden context and enriches scene and background details. To iteratively refine the prompt, a self-evaluation agent aligns the modified prompt with the original input, while a feedback-tuning agent incorporates user feedback for further refinement. Experimental results demonstrate that PromptSculptor significantly enhances output quality and reduces the number of iterations needed for user satisfaction. Moreover, its model-agnostic design allows seamless integration with various T2I models, paving the way for industrial applications.
cs.MM
[496] MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization
Jianxuan Yang, Xiaoran Yang, Lipan Zhang, Xinyue Guo, Zhao Wang, Gongping Huang
Main category: cs.MM
TL;DR: MultiSoundGen is a novel V2A framework that addresses challenges in complex multi-event scenarios through SlowFast Contrastive AVP for semantic-temporal alignment and AVP-RPO for preference optimization, achieving SOTA performance.
Details
Motivation: Current V2A methods struggle with precise semantic-temporal alignment and lack quantitative optimization for complex multi-event scenarios involving multiple sound sources and transitions.Method: Proposes MultiSoundGen with two innovations: 1) SlowFast Contrastive AVP (SF-CAVP) - a dual-stream architecture for aligning semantic representations and dynamic features; 2) AVP-Ranked Preference Optimization (AVP-RPO) using SF-CAVP as reward model for semantic-temporal matching and audio quality enhancement.
Result: MultiSoundGen achieves state-of-the-art performance in multi-event scenarios with comprehensive gains in distribution matching, audio quality, semantic alignment, and temporal synchronization.
Conclusion: The framework successfully addresses core limitations of existing V2A methods through innovative AVP and preference optimization techniques, demonstrating superior performance in complex multi-event sound generation.
Abstract: Current video-to-audio (V2A) methods struggle in complex multi-event scenarios (video scenarios involving multiple sound sources, sound events, or transitions) due to two critical limitations. First, existing methods face challenges in precisely aligning intricate semantic information together with rapid dynamic features. Second, foundational training lacks quantitative preference optimization for semantic-temporal alignment and audio quality. As a result, it fails to enhance integrated generation quality in cluttered multi-event scenes. To address these core limitations, this study proposes a novel V2A framework: MultiSoundGen. It introduces direct preference optimization (DPO) into the V2A domain, leveraging audio-visual pretraining (AVP) to enhance performance in complex multi-event scenarios. Our contributions include two key innovations: the first is SlowFast Contrastive AVP (SF-CAVP), a pioneering AVP model with a unified dual-stream architecture. SF-CAVP explicitly aligns core semantic representations and rapid dynamic features of audio-visual data to handle multi-event complexity; second, we integrate the DPO method into V2A task and propose AVP-Ranked Preference Optimization (AVP-RPO). It uses SF-CAVP as a reward model to quantify and prioritize critical semantic-temporal matches while enhancing audio quality. Experiments demonstrate that MultiSoundGen achieves state-of-the-art (SOTA) performance in multi-event scenarios, delivering comprehensive gains across distribution matching, audio quality, semantic alignment, and temporal synchronization. The complete code and dataset will be released soon.
[497] Comparative Study of Subjective Video Quality Assessment Test Methods in Crowdsourcing for Varied Use Cases
Babak Naderi, Ross Cutler
Main category: cs.MM
TL;DR: This paper compares three subjective video quality assessment methods (ACR, ACR-HR, and CCR) across multiple studies, finding that ACR-HR is most cost-effective but CCR is more sensitive to quality improvements beyond the reference.
Details
Motivation: To provide practical guidance on choosing between Absolute Category Rating (ACR), ACR with Hidden Reference (ACR-HR), and Comparison Category Rating (CCR) methods for crowdsourced subjective video quality assessment.Method: Conducted P.910-compliant side-by-side comparison across six studies using 15 talking-head sources with realistic degradations (blur, scaling, compression, freezing) and bitrate-ladder tasks at 720p and 1080p resolutions.
Result: ACR-HR and ACR correlate strongly at condition level, but CCR is more sensitive to improvements beyond reference quality. ACR-HR is approximately twice as fast and cost-effective with lower variability, but exhibits compressed scale use for fair-quality videos.
Conclusion: The choice of quality measurement method affects saturation points and bitrate-ladder recommendations. The paper provides practical guidance on when to use each test method based on specific assessment needs.
Abstract: In crowdsourced subjective video quality assessment, practitioners often face a choice between Absolute Category Rating (ACR), ACR with Hidden Reference (ACR-HR), and Comparison Category Rating (CCR). We conducted a P.910-compliant, side-by-side comparison across six studies using 15 talking-head sources of good and fair quality, processed with realistic degradations (blur, scaling, compression, freezing, and their combinations), as well as a practical bitrate-ladder task at 720p and 1080p resolutions. We evaluated statistical efficiency (standard deviations), economic efficiency, and decision agreement. Our results show that ACR-HR and ACR correlate strongly at the condition level, while CCR is more sensitive-capturing improvements beyond the reference. ACR-HR, however, exhibits compressed scale use, particularly for videos with fair source quality. ACR-HR is approximately twice as fast and cost-effective, with lower normalized variability, yet the choice of quality measurement method shifts saturation points and bitrate-ladder recommendations. Finally, we provide practical guidance on when to use each test method.
[498] InconVAD: A Two-Stage Dual-Tower Framework for Multimodal Emotion Inconsistency Detection
Zongyi Li, Junchuan Zhao, Francis Bu Sung Lee, Andrew Zi Han Yee
Main category: cs.MM
TL;DR: InconVAD is a two-stage framework for detecting emotional inconsistency across speech and text modalities using Valence/Arousal/Dominance space, with uncertainty-aware unimodal predictions and selective fusion of consistent signals.
Details
Motivation: Existing approaches rely on incomplete emotion representations and unconditional fusion, which weakens performance when modalities are inconsistent. Little prior work explicitly addresses inconsistency detection itself.Method: Two-stage framework: 1) Independent uncertainty-aware models yield robust unimodal predictions, 2) A classifier identifies cross-modal inconsistency and selectively integrates consistent signals using VAD space.
Result: Extensive experiments show that InconVAD surpasses existing methods in both multimodal emotion inconsistency detection and modeling.
Conclusion: InconVAD offers a more reliable and interpretable solution for emotion analysis by effectively handling modality inconsistencies.
Abstract: Detecting emotional inconsistency across modalities is a key challenge in affective computing, as speech and text often convey conflicting cues. Existing approaches generally rely on incomplete emotion representations and employ unconditional fusion, which weakens performance when modalities are inconsistent. Moreover, little prior work explicitly addresses inconsistency detection itself. We propose InconVAD, a two-stage framework grounded in the Valence/Arousal/Dominance (VAD) space. In the first stage, independent uncertainty-aware models yield robust unimodal predictions. In the second stage, a classifier identifies cross-modal inconsistency and selectively integrates consistent signals. Extensive experiments show that InconVAD surpasses existing methods in both multimodal emotion inconsistency detection and modeling, offering a more reliable and interpretable solution for emotion analysis.
[499] Embedding Alignment in Code Generation for Audio
Sam Kouteili, Hiren Madhu, George Typaldos, Mark Santolucito
Main category: cs.MM
TL;DR: This paper investigates the relationship between code and audio embeddings to improve diversity in LLM-generated code for live-coding, proposing a model that predicts audio embeddings from code to enhance musical output variety.
Details
Motivation: LLMs struggle to generate diverse code candidates for creative coding like live-coding, lacking insight into audio output, which limits users' ability to realize musical intentions through varied code suggestions.Method: The study analyzes the topology between code and audio embedding spaces, constructs a predictive model to learn an embedding alignment map, and presents a model that predicts audio embeddings from code inputs.
Result: Findings show code and audio embeddings don’t have a simple linear relationship, but an embedding alignment map can be learned to better connect code candidates with their audio outputs.
Conclusion: The proposed code-audio embedding alignment model enables more musically diverse output by establishing a predictive relationship between code and audio, enhancing LLM-powered creative coding capabilities.
Abstract: LLM-powered code generation has the potential to revolutionize creative coding endeavors, such as live-coding, by enabling users to focus on structural motifs over syntactic details. In such domains, when prompting an LLM, users may benefit from considering multiple varied code candidates to better realize their musical intentions. Code generation models, however, struggle to present unique and diverse code candidates, with no direct insight into the code’s audio output. To better establish a relationship between code candidates and produced audio, we investigate the topology of the mapping between code and audio embedding spaces. We find that code and audio embeddings do not exhibit a simple linear relationship, but supplement this with a constructed predictive model that shows an embedding alignment map could be learned. Supplementing the aim for musically diverse output, we present a model that given code predicts output audio embedding, constructing a code-audio embedding alignment map.
[500] CPCLDETECTOR: Knowledge Enhancement and Alignment Selection for Chinese Patronizing and Condescending Language Detection
Jiaxun Yang, Yifei Han, Long Zhang, Yujie Liu, Bin Li, Bo Gao, Yangfan He, Kejia Zhan
Main category: cs.MM
TL;DR: This paper addresses the limitation in detecting Chinese Patronizing and Condescending Language (CPCL) by creating a new dataset PCLMMPLUS with 103k user comments and proposing CPCLDetector model with alignment selection and knowledge-enhanced modules, achieving state-of-the-art performance.
Details
Motivation: Existing CPCL datasets lack user comments, which are crucial for understanding video content and detecting CPCL videos accurately. This gap leads to failure in identifying some CPCL content.Method: The research reconstructs PCLMMPLUS dataset with 103k comment entries and proposes CPCLDetector model featuring alignment selection and knowledge-enhanced comment content modules.
Result: Extensive experiments show CPCLDetector outperforms state-of-the-art methods on PCLMM and achieves higher performance on PCLMMPLUS, enabling more accurate detection of CPCL videos.
Conclusion: The proposed approach supports content governance and protects vulnerable groups by improving CPCL video detection accuracy. Code and dataset are publicly available.
Abstract: Chinese Patronizing and Condescending Language (CPCL) is an implicitly discriminatory toxic speech targeting vulnerable groups on Chinese video platforms. The existing dataset lacks user comments, which are a direct reflection of video content. This undermines the model’s understanding of video content and results in the failure to detect some CPLC videos. To make up for this loss, this research reconstructs a new dataset PCLMMPLUS that includes 103k comment entries and expands the dataset size. We also propose the CPCLDetector model with alignment selection and knowledge-enhanced comment content modules. Extensive experiments show the proposed CPCLDetector outperforms the SOTA on PCLMM and achieves higher performance on PCLMMPLUS . CPLC videos are detected more accurately, supporting content governance and protecting vulnerable groups. Code and dataset are available at https://github.com/jiaxunyang256/PCLD.
eess.AS
[501] Frame-Stacked Local Transformers For Efficient Multi-Codebook Speech Generation
Roy Fejgin, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Ryan Langman Jaehyeon Kim, Subhankar Ghosh, Shehzeen Hussain, Jason Li
Main category: eess.AS
TL;DR: This paper investigates hierarchical strategies for speech generation using LLMs, comparing autoregressive and MaskGIT-based transformers to handle multicodebook dependencies in acoustic codes, and provides guidelines for decoding strategy selection based on efficiency and quality tradeoffs.
Details
Motivation: Speech generation models based on LLMs face challenges with discrete acoustic codes that have multicodebook structure, requiring joint prediction of N codebook entries per timestep. Parallel prediction approaches assume independence among codebooks, which reduces fidelity, necessitating better methods to capture intra-timestep dependencies.Method: The paper systematically investigates two local transformer (LT) architectures: an autoregressive transformer that generates codebooks sequentially, and a MaskGIT-based transformer that performs iterative masked prediction. Both enable frame stacking where the primary transformer predicts multiple frames jointly and the LT decodes their codebooks.
Result: The analysis characterizes tradeoffs between parallel and iterative sampling strategies across different throughput and quality regimes. Frame stacking offers improvements in speed without compromising perceptual quality.
Conclusion: The paper proposes practical guidelines for selecting decoding strategies based on deployment priorities such as computational efficiency and synthesis fidelity, providing insights into optimal approaches for different use cases.
Abstract: Speech generation models based on large language models (LLMs) typically operate on discrete acoustic codes, which differ fundamentally from text tokens due to their multicodebook structure. At each timestep, models must predict N codebook entries jointly, introducing dependencies that challenge simple parallel prediction approaches. Parallel prediction assumes independence among codebooks, yielding efficient decoding but often at the cost of reduced fidelity. To address this, hierarchical strategies employ a local transformer (LT) to refine predictions and capture intra-timestep dependencies. In this work, we systematically investigate two LT architectures: an autoregressive transformer that generates codebooks sequentially, and a MaskGIT-based transformer that performs iterative masked prediction. Both designs further enable frame stacking, where the primary transformer predicts multiple frames jointly, and the LT decodes their codebooks, offering improvements in speed without compromising perceptual quality. Through extensive analysis, we characterize the tradeoffs between parallel and iterative sampling strategies across different throughput and quality regimes. Finally, we propose practical guidelines for selecting decoding strategies based on deployment priorities such as computational efficiency and synthesis fidelity.
[502] Advancing Speech Summarization in Multi-modal LLMs with Reinforcement Learning
Shaoshi Ling, Gang Liu, Guoli Ye, Jinyu Li
Main category: eess.AS
TL;DR: A novel multi-stage reinforcement learning framework to enhance speech summarization in multi-modal large language models (MLLMs), narrowing the performance gap with text-based LLMs.
Details
Motivation: Open-source MLLMs lag behind state-of-the-art text-based LLMs for speech summarization, limiting practical deployment despite the growing need for spoken content understanding.Method: Multi-stage reinforcement learning training framework that enables MLLMs to generate textual summaries directly from speech without intermediate transcriptions.
Result: Substantial improvements over strong baselines, outperforms larger MLLMs, and significantly narrows the gap with state-of-the-art text-based LLMs.
Conclusion: The proposed framework effectively enhances speech summarization capabilities in MLLMs, making them more competitive with text-based approaches for practical applications.
Abstract: Speech summarization is a critical component of spoken content understanding, particularly in the era of rapidly growing spoken and audiovisual data. Recent advances in multi-modal large language models (MLLMs), leveraging the power of LLMs, enable generating textual summaries directly from speech without intermediate transcriptions, while supporting controllable styles and zero-shot generalization. However, open-source MLLMs continue to lag behind the state-of-the-art text-based LLMs, limiting their practical deployment for speech summarization. In this work, we present a novel multi-stage reinforcement learning training framework to enhance the speech summarization capabilities in MLLMs. Our model delivers substantial improvements over strong baselines, outperforms much larger MLLMs, and significantly narrows the gap with state-of-the-art text-based LLMs.
[503] Selective Classifier-free Guidance for Zero-shot Text-to-speech
John Zheng, Farhad Maleki
Main category: eess.AS
TL;DR: CFG strategies from image generation generally fail in speech synthesis, but selective CFG timing and text-representation dependent approaches can improve speaker similarity while maintaining text adherence.
Details
Motivation: Achieving balance between speaker fidelity and text adherence in zero-shot text-to-speech is challenging, and CFG strategies successful in image generation are underexplored for speech synthesis.Method: Evaluate adaptability of CFG strategies from image generation to speech synthesis, extend separated-condition CFG approaches, and test selective CFG timing (standard CFG early, selective CFG later).
Result: CFG strategies effective in image generation generally fail in speech synthesis. Selective CFG timing improves speaker similarity with limited text adherence degradation. Effectiveness is highly text-representation dependent (English vs Mandarin show different results).
Conclusion: CFG strategies require domain-specific adaptation for speech synthesis, with timing and text-representation considerations being crucial factors for success.
Abstract: In zero-shot text-to-speech, achieving a balance between fidelity to the target speaker and adherence to text content remains a challenge. While classifier-free guidance (CFG) strategies have shown promising results in image generation, their application to speech synthesis are underexplored. Separating the conditions used for CFG enables trade-offs between different desired characteristics in speech synthesis. In this paper, we evaluate the adaptability of CFG strategies originally developed for image generation to speech synthesis and extend separated-condition CFG approaches for this domain. Our results show that CFG strategies effective in image generation generally fail to improve speech synthesis. We also find that we can improve speaker similarity while limiting degradation of text adherence by applying standard CFG during early timesteps and switching to selective CFG only in later timesteps. Surprisingly, we observe that the effectiveness of a selective CFG strategy is highly text-representation dependent, as differences between the two languages of English and Mandarin can lead to different results even with the same model.
[504] Short-Segment Speaker Verification with Pre-trained Models and Multi-Resolution Encoder
Jisoo Myoung, Sangwook Han, Kihyuk Kim, Jong Won Shin
Main category: eess.AS
TL;DR: Proposes a speaker verification system that combines pre-trained model features with filterbank features and multi-resolution time domain encoder features to improve performance on short-segment speaker verification.
Details
Motivation: Current pre-trained models for speaker verification have lower temporal resolution (20ms) than typical filterbank features, which is problematic for short-segment SV where input segments are shorter than 2 seconds. Existing multi-resolution approaches only consider lower resolution features.Method: Utilizes PTM features along with filterbank features and features from a multi-resolution time domain encoder with window shifts of 25, 50, 100, and 200 samples to capture information at multiple temporal resolutions.
Result: Experimental results on the VoxCeleb dataset with various input lengths showed consistent improvements over systems with various combinations of input features.
Conclusion: The proposed multi-resolution feature combination approach effectively improves speaker verification performance, particularly for short-segment scenarios, by capturing complementary information from different temporal resolutions.
Abstract: Speaker verification (SV) utilizing features obtained from models pre-trained via self-supervised learning has recently demonstrated impressive performances. However, these pre-trained models (PTMs) usually have a temporal resolution of 20 ms, which is lower than typical filterbank features. It may be problematic especially for short-segment SV with an input segment shorter than 2 s, in which we need to extract as much information as possible from the input with a limited length. Although there have been approaches to utilize multi-resolution features from the HuBERT models, the window shifts were 320, 640, and 1600 samples when the sampling rate was 16 kHz and thus only lower resolution features were considered. In this study, we propose an SV system which utilizes PTM features along with filterbank features and those from the multi-resolution time domain encoder with window shifts of 25, 50, 100, and 200 samples. Experimental results on the VoxCeleb dataset with various input lengths showed consistent improvements over systems with various combinations of input features.
[505] MMedFD: A Real-world Healthcare Benchmark for Multi-turn Full-Duplex Automatic Speech Recognition
Hongzhao Chen, XiaoYang Wang, Jing Lan, Hexiao Ding, Yufeng Jiang MingHui Yang, DanHui Xu, Jun Luo, Nga-Chun Ng, Gerald W. Y. Cheng, Yunlin Mao, Jung Sun Yoo
Main category: eess.AS
TL;DR: MMedFD is the first real-world Chinese healthcare ASR corpus for multi-turn, full-duplex settings, featuring 5,805 annotated sessions from a deployed AI assistant with comprehensive timing and role labels.
Details
Motivation: There is a scarcity of open benchmarks for clinical dialogue ASR that can handle full-duplex interaction, speaker overlap, and low-latency constraints in real-world healthcare settings.Method: The authors introduce a model-agnostic pipeline for streaming segmentation, speaker attribution, and dialogue memory, and fine-tune Whisper-small on role-concatenated audio for long-context recognition.
Result: The dataset includes comprehensive ASR evaluation metrics (WER, CER, HC-WER) and LLM-generated response assessment using rubric-based and pairwise protocols.
Conclusion: MMedFD establishes a reproducible framework for benchmarking streaming ASR and end-to-end duplex agents in healthcare deployment, with the dataset and resources publicly available.
Abstract: Automatic speech recognition (ASR) in clinical dialogue demands robustness to full-duplex interaction, speaker overlap, and low-latency constraints, yet open benchmarks remain scarce. We present MMedFD, the first real-world Chinese healthcare ASR corpus designed for multi-turn, full-duplex settings. Captured from a deployed AI assistant, the dataset comprises 5,805 annotated sessions with synchronized user and mixed-channel views, RTTM/CTM timing, and role labels. We introduce a model-agnostic pipeline for streaming segmentation, speaker attribution, and dialogue memory, and fine-tune Whisper-small on role-concatenated audio for long-context recognition. ASR evaluation includes WER, CER, and HC-WER, which measures concept-level accuracy across healthcare settings. LLM-generated responses are assessed using rubric-based and pairwise protocols. MMedFD establishes a reproducible framework for benchmarking streaming ASR and end-to-end duplex agents in healthcare deployment. The dataset and related resources are publicly available at https://github.com/Kinetics-JOJO/MMedFD
[506] SCORE: Scaling audio generation using Standardized COmposite REwards
Jaemin Jung, Jaehun Kim, Inkyu Shin, Joon Son Chung
Main category: eess.AS
TL;DR: This paper introduces Inference-Time Scaling with multi-reward guidance to enhance Text-to-Audio generation, improving both perceptual quality and textual alignment without additional training.
Details
Motivation: Existing Text-to-Audio models struggle to balance perceptual quality and textual alignment reliably. The authors aim to address this limitation by leveraging inference-time optimization techniques.Method: The method adopts Inference-Time Scaling (training-free) and proposes a novel multi-reward guidance system that normalizes different reward components into a common scale and combines them with weighted summation. It also introduces a new audio-text alignment metric using an audio language model.
Result: Empirical results show that the method significantly improves both semantic alignment and perceptual quality, outperforming naive generation and existing reward guidance techniques.
Conclusion: The proposed Inference-Time Scaling with multi-reward guidance effectively enhances Text-to-Audio generation performance, providing stable guidance and explicit control over desired audio aspects without requiring additional training.
Abstract: The goal of this paper is to enhance Text-to-Audio generation at inference, focusing on generating realistic audio that precisely aligns with text prompts. Despite the rapid advancements, existing models often fail to achieve a reliable balance between perceptual quality and textual alignment. To address this, we adopt Inference-Time Scaling, a training-free method that improves performance by increasing inference computation. We establish its unexplored application to audio generation and propose a novel multi-reward guidance that equally signifies each component essential in perception. By normalizing each reward value into a common scale and combining them with a weighted summation, the method not only enforces stable guidance but also enables explicit control to reach desired aspects. Moreover, we introduce a new audio-text alignment metric using an audio language model for more robust evaluation. Empirically, our method improves both semantic alignment and perceptual quality, significantly outperforming naive generation and existing reward guidance techniques. Synthesized samples are available on our demo page: https://mm.kaist.ac.kr/projects/score
[507] Weakly Supervised Phonological Features for Pathological Speech Analysis
Jenthe Thienpondt, Geoffroy Vanderreydt, Abdessalem Hammami, Kris Demuynck
Main category: eess.AS
TL;DR: Proposes a weakly supervised method using ASR with phonological feature bottleneck for paralinguistic speech analysis, achieving competitive results for intelligibility prediction and pathology classification with interpretable features.
Details
Motivation: Lack of labeled frame-level datasets for paralinguistic speech properties makes automatic modeling difficult, especially for speech disorder analysis and treatment optimization.Method: Uses weakly supervised training with ASR model containing interpretable frame-level phonological feature bottleneck layer, exploiting known acoustic properties of phonemes.
Result: Phonological features perform similarly to state-of-the-art acoustic features (75% classification accuracy, 8.43 RMSE for intelligibility prediction) while being text-independent and interpretable.
Conclusion: The proposed phonological features provide useful interpretable insights for speech therapists while maintaining competitive performance on speech pathology analysis tasks.
Abstract: Paralinguistic properties of speech are essential in analyzing and choosing optimal treatment options for patients with speech disorders. However, automatic modeling of these characteristics is difficult due to the lack of labeled speech datasets describing paralinguistic properties, especially at the frame-level. In this paper, we propose a weakly supervised training method which exploits the known acoustic properties of phonemes by training an ASR model with an interpretable frame-level phonological feature bottleneck layer. Subsequently, we assess the viability of these phonological features in speech pathology analysis by developing corresponding models for intelligibility prediction and speech pathology classification. Models using our proposed phonological features perform similar to other state-of-the-art acoustic features on both tasks with a classification accuracy of 75% and a 8.43 RMSE on speech intelligibility prediction. In contrast to others, our phonological features are text-independent and highly interpretable, providing potentially useful insights for speech therapists.
[508] MAGE: A Coarse-to-Fine Speech Enhancer with Masked Generative Model
The Hieu Pham, Tan Dat Nguyen, Phuong Thanh Tran, Joon Son Chun, Duc Dung Nguyen
Main category: eess.AS
TL;DR: MAGE is a Masked Audio Generative Enhancer that improves speech enhancement efficiency and quality through a novel coarse-to-fine masking strategy and lightweight corrector module, achieving state-of-the-art results with only 200M parameters.
Details
Motivation: Speech enhancement faces a trade-off between efficiency and perceptual quality. Current masked generative models use random masking, which is inefficient and lacks generalization. MAGE aims to overcome these limitations with a more intelligent masking approach.Method: MAGE employs a scarcity-aware coarse-to-fine masking strategy that prioritizes frequent tokens early and rare tokens later. It includes a lightweight corrector module to detect and refine low-confidence predictions. Built on BigCodec and finetuned from Qwen2.5-0.5B, it’s reduced to 200M parameters through selective layer retention.
Result: Experiments on DNS Challenge and noisy LibriSpeech show MAGE achieves state-of-the-art perceptual quality and significantly reduces word error rate for downstream recognition, outperforming larger baselines.
Conclusion: MAGE demonstrates that intelligent masking strategies and lightweight correction modules can significantly improve speech enhancement efficiency and quality, achieving superior performance with fewer parameters compared to existing approaches.
Abstract: Speech enhancement remains challenging due to the trade-off between efficiency and perceptual quality. In this paper, we introduce MAGE, a Masked Audio Generative Enhancer that advances generative speech enhancement through a compact and robust design. Unlike prior masked generative models with random masking, MAGE employs a scarcity-aware coarse-to-fine masking strategy that prioritizes frequent tokens in early steps and rare tokens in later refinements, improving efficiency and generalization. We also propose a lightweight corrector module that further stabilizes inference by detecting low-confidence predictions and re-masking them for refinement. Built on BigCodec and finetuned from Qwen2.5-0.5B, MAGE is reduced to 200M parameters through selective layer retention. Experiments on DNS Challenge and noisy LibriSpeech show that MAGE achieves state-of-the-art perceptual quality and significantly reduces word error rate for downstream recognition, outperforming larger baselines. Audio examples are available at https://hieugiaosu.github.io/MAGE/.
[509] Voice Privacy Preservation with Multiple Random Orthogonal Secret Keys: Attack Resistance Analysis
Kohei Tanaka, Hitoshi Kiya, Sayaka Shiota
Main category: eess.AS
TL;DR: Proposed method enhances speech privacy protection by using multiple random orthogonal matrices as secret keys, improving attack resistance and expanding applicability to more deep learning models.
Details
Motivation: Growing concerns about speech privacy in cloud-based deep learning models, with existing methods having limited attack resistance and model constraints.Method: Uses multiple random orthogonal matrices as secret keys instead of single matrix, with approaches to relax model constraints for broader applicability.
Result: Experimental results show maintained privacy protection performance for speaker concealment under more powerful attack scenarios.
Conclusion: The proposed method successfully enhances attack resistance and expands model applicability while preserving speech privacy protection performance.
Abstract: Recently, opportunities to transmit speech data to deep learning models executed in the cloud have increased. This has led to growing concerns about speech privacy, including both speaker-specific information and the linguistic content of utterances. As an approach to preserving speech privacy, a speech privacy-preserving method based on encryption using a secret key with a random orthogonal matrix has been proposed. This method enables cloud-based model inference while concealing both the speech content and the speaker identity. However, the method has limited attack resistance and is constrained in terms of the deep learning models to which the encryption can be applied. In this work, we propose a method that enhances the attack resistance of the conventional speech privacy-preserving technique by employing multiple random orthogonal matrices as secret keys. We also introduce approaches to relax the model constraints, enabling the application of our method to a broader range of deep learning models. Furthermore, we investigate the robustness of the proposed method against attacks using extended attack scenarios based on the scenarios employed in the Voice Privacy Challenge. Our experimental results confirmed that the proposed method maintains privacy protection performance for speaker concealment, even under more powerful attack scenarios not considered in prior work.
[510] Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration
Yifan Yang, Bing Han, Hui Wang, Long Zhou, Wei Wang, Mingyu Cui, Xu Tan, Xie Chen
Main category: eess.AS
TL;DR: ProsodyEval introduces a new dataset and DS-WED metric for better prosody diversity assessment in TTS systems, showing superior correlation with human perception compared to existing acoustic metrics.
Details
Motivation: Current acoustic metrics capture only partial prosodic variation and correlate poorly with human perception, leaving prosody diversity quantification underexplored in TTS systems.Method: Created ProsodyEval dataset with 1000 speech samples from 7 TTS systems and 2000 human ratings, then proposed DS-WED metric using weighted edit distance over semantic tokens from HuBERT and WavLM.
Result: DS-WED achieves substantially higher correlation with human judgments than existing acoustic metrics and remains robust across different speech tokenization methods. Benchmarking revealed factors influencing prosody diversity including generative modeling paradigms and duration control.
Conclusion: The proposed DS-WED metric effectively quantifies prosody diversity and current large audio language models remain limited in capturing prosodic variations, highlighting areas for future improvement in TTS systems.
Abstract: Prosody diversity is essential for achieving naturalness and expressiveness in zero-shot text-to-speech (TTS). However, frequently used acoustic metrics capture only partial views of prosodic variation and correlate poorly with human perception, leaving the problem of reliably quantifying prosody diversity underexplored. To bridge this gap, we introduce ProsodyEval, a prosody diversity assessment dataset that provides Prosody Mean Opinion Score (PMOS) alongside conventional acoustic metrics. ProsodyEval comprises 1000 speech samples derived from 7 mainstream TTS systems, with 2000 human ratings. Building on this, we propose the Discretized Speech Weighted Edit Distance (DS-WED), a new objective diversity metric that quantifies prosodic variation via weighted edit distance over semantic tokens. Experiments on ProsodyEval show that DS-WED achieves substantially higher correlation with human judgments than existing acoustic metrics, while remaining highly robust in speech tokenization from HuBERT and WavLM. Leveraging DS-WED, we benchmark state-of-the-art open-source TTS systems on LibriSpeech test-clean and Seed-TTS test-en, and further explorations uncover several factors that influence prosody diversity, including generative modeling paradigms, duration control, and reinforcement learning. Moreover, we find that current large audio language models (LALMs) remain limited in capturing prosodic variations. Audio samples are available at https://prosodyeval.github.io.
[511] Evaluating pretrained speech embedding systems for dysarthria detection across heterogenous datasets
Lovisa Wihlborg, Jemima Goodall, David Wheatley, Jacob J. Webber, Johnny Tam, Christine Weaver, Suvankar Pal, Siddharthan Chandran, Sohan Seth, Oliver Watts, Cassia Valentini-Botinhao
Main category: eess.AS
TL;DR: Evaluation of 17 pretrained speech embedding systems for dysarthric speech detection across 6 datasets, addressing data limitations through cross-validation and null hypothesis testing.
Details
Motivation: Dysarthric speech datasets are often small, imbalanced, and suffer from recording biases, requiring robust evaluation methods to ensure clinical validity.Method: Used cross-validation runs to estimate chance level, compared scores against null hypothesis distribution, evaluated within-dataset and cross-dataset performance across 6 datasets.
Result: Within-dataset results varied considerably by dataset, cross-dataset accuracy was lower than within-dataset, raising concerns about generalization and benchmarking practices.
Conclusion: Findings question the clinical validity of systems trained and tested on the same dataset, highlighting generalization challenges in dysarthric speech detection.
Abstract: We present a comprehensive evaluation of pretrained speech embedding systems for the detection of dysarthric speech using existing accessible data. Dysarthric speech datasets are often small and can suffer from recording biases as well as data imbalance. To address these we selected a range of datasets covering related conditions and adopt the use of several cross-validations runs to estimate the chance level. To certify that results are above chance, we compare the distribution of scores across these runs against the distribution of scores of a carefully crafted null hypothesis. In this manner, we evaluate 17 publicly available speech embedding systems across 6 different datasets, reporting the cross-validation performance on each. We also report cross-dataset results derived when training with one particular dataset and testing with another. We observed that within-dataset results vary considerably depending on the dataset, regardless of the embedding used, raising questions about which datasets should be used for benchmarking. We found that cross-dataset accuracy is, as expected, lower than within-dataset, highlighting challenges in the generalization of the systems. These findings have important implications for the clinical validity of systems trained and tested on the same dataset.
[512] Discrete Diffusion for Generative Modeling of Text-Aligned Speech Tokens
Pin-Jui Ku, He Huang, Jean-Marie Lemercier, Subham Sekhar Sahoo, Zhehuai Chen, Ante Jukić
Main category: eess.AS
TL;DR: A discrete diffusion model (DDM) framework for text-aligned speech tokenization and reconstruction that improves quality, ASR performance, and inference speed compared to auto-regressive models.
Details
Motivation: To enhance speech reconstruction quality and efficiency by replacing auto-regressive decoders with discrete diffusion models, while systematically improving tokenization methods.Method: Replaces auto-regressive speech decoder with discrete diffusion counterpart, compares vector quantization modules (FSQ vs RVQ), and analyzes sampler choices, inference steps, and length-scale estimation robustness.
Result: Achieves significantly better reconstruction quality, stronger ASR performance, 35% relative WER reduction, +0.14 UT-MOS improvement, and generates speech in just 10 denoising steps with single-step generation capability.
Conclusion: DDM framework provides superior speech reconstruction with faster inference, demonstrating the effectiveness of discrete diffusion models for speech tokenization tasks.
Abstract: This paper introduces a discrete diffusion model (DDM) framework for text-aligned speech tokenization and reconstruction. By replacing the auto-regressive speech decoder with a discrete diffusion counterpart, our model achieves significantly better reconstruction quality, stronger ASR performance, and faster inference. We provide a comprehensive analysis of applying DDMs to speech reconstruction, examining sampler choices, inference steps, and robustness to length-scale estimation errors. Furthermore, we improve the original TASTE by systematically comparing vector quantization modules, showing that FSQ yields up to a 35% relative WER reduction and +0.14 UT-MOS improvement over RVQ for AR models, while also enhancing DDM performance. Our model generates speech in just 10 denoising steps and even supports single-step generation with only minor quality degradation.
[513] ctPuLSE: Close-Talk, and Pseudo-Label Based Far-Field, Speech Enhancement
Zhong-Qiu Wang
Main category: eess.AS
TL;DR: The paper proposes ctPuLSE, a system that uses close-talk speech enhancement to generate pseudo-labels for training far-field speech enhancement models on real-recorded data, addressing the limitation of supervised models trained on simulated data.
Details
Motivation: Current neural speech enhancement models trained on simulated far-field noisy-reverberant speech have limited generalizability to real-recorded mixtures. The challenge is that clean speech for real mixtures is unavailable, making direct supervision difficult.Method: The approach involves: 1) training an enhancement model on simulated mixtures to enhance real-recorded close-talk mixtures, 2) using the estimated close-talk speech as pseudo-labels to supervise training of far-field speech enhancement models on paired real-recorded far-field mixtures.
Result: Evaluation on CHiME-4 dataset shows that ctPuLSE can generate high-quality pseudo-labels and produce far-field speech enhancement models with strong generalizability to real data.
Conclusion: The proposed ctPuLSE system effectively addresses the generalizability issue by leveraging close-talk enhancement to create supervision for training on real mixtures, resulting in models that perform well on real-world data.
Abstract: The current dominant approach for neural speech enhancement is via purely-supervised deep learning on simulated pairs of far-field noisy-reverberant speech (i.e., mixtures) and clean speech. The trained models, however, often exhibit limited generalizability to real-recorded mixtures. To deal with this, this paper investigates training enhancement models directly on real mixtures. However, a major difficulty challenging this approach is that, since the clean speech of real mixtures is unavailable, there lacks a good supervision for real mixtures. In this context, assuming that a training set consisting of real-recorded pairs of close-talk and far-field mixtures is available, we propose to address this difficulty via close-talk speech enhancement, where an enhancement model is first trained on simulated mixtures to enhance real-recorded close-talk mixtures and the estimated close-talk speech can then be utilized as a supervision (i.e., pseudo-label) for training far-field speech enhancement models directly on the paired real-recorded far-field mixtures. We name the proposed system ctPuLSE. Evaluation results on the popular CHiME-4 dataset show that ctPuLSE can derive high-quality pseudo-labels and yield far-field speech enhancement models with strong generalizability to real data.
[514] A GEN AI Framework for Medical Note Generation
Hui Yi Leong, Yi Fan Gao, Shuai Ji, Bora Kalaycioglu, Uktu Pamuksuz
Main category: eess.AS
TL;DR: MediNotes is an AI framework that automates SOAP note generation from medical conversations using LLMs, RAG, and ASR to reduce EHR documentation burden on physicians.
Details
Motivation: Address the increasing administrative burden of EHR documentation that reduces direct patient care time and contributes to physician burnout.Method: Integrates Large Language Models, Retrieval-Augmented Generation, and Automatic Speech Recognition with advanced techniques like QLoRA and PEFT for efficient fine-tuning in resource-constrained environments.
Result: Evaluations on ACI-BENCH dataset show significant improvements in accuracy, efficiency, and usability of automated medical documentation.
Conclusion: MediNotes offers a robust solution to reduce administrative burden on healthcare professionals while improving clinical workflow quality.
Abstract: The increasing administrative burden of medical documentation, particularly through Electronic Health Records (EHR), significantly reduces the time available for direct patient care and contributes to physician burnout. To address this issue, we propose MediNotes, an advanced generative AI framework designed to automate the creation of SOAP (Subjective, Objective, Assessment, Plan) notes from medical conversations. MediNotes integrates Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and Automatic Speech Recognition (ASR) to capture and process both text and voice inputs in real time or from recorded audio, generating structured and contextually accurate medical notes. The framework also incorporates advanced techniques like Quantized Low-Rank Adaptation (QLoRA) and Parameter-Efficient Fine-Tuning (PEFT) for efficient model fine-tuning in resource-constrained environments. Additionally, MediNotes offers a query-based retrieval system, allowing healthcare providers and patients to access relevant medical information quickly and accurately. Evaluations using the ACI-BENCH dataset demonstrate that MediNotes significantly improves the accuracy, efficiency, and usability of automated medical documentation, offering a robust solution to reduce the administrative burden on healthcare professionals while improving the quality of clinical workflows.
[515] EAI-Avatar: Emotion-Aware Interactive Talking Head Generation
Haijie Yang, Zhenyu Zhang, Hao Tang, Jianjun Qian, Jian Yang
Main category: eess.AS
TL;DR: EAI-Avatar is an emotion-aware talking head generation framework for bidirectional conversational interactions that produces temporally consistent avatars with rich emotional variations using LLMs and a novel interactive talking tree structure.
Details
Motivation: Most existing talking head generation methods focus on one-way portrait animation and lack precise emotion-adaptive capabilities for bidirectional conversational interactions, limiting practical applicability.Method: Uses LLMs for dialogue generation, a Transformer-based head mask generator for temporally consistent motion features, and an interactive talking tree structure with reverse-level traversal to extract historical emotional cues for expression synthesis.
Result: Extensive experiments demonstrate superior performance and effectiveness in generating virtual avatars that seamlessly transition between speaking and listening states with rich emotional variations.
Conclusion: The proposed EAI-Avatar framework successfully addresses the limitations of existing methods by enabling emotion-aware bidirectional conversational interactions with temporally consistent motion and rich emotional expressions.
Abstract: Generative models have advanced rapidly, enabling impressive talking head generation that brings AI to life. However, most existing methods focus solely on one-way portrait animation. Even the few that support bidirectional conversational interactions lack precise emotion-adaptive capabilities, significantly limiting their practical applicability. In this paper, we propose EAI-Avatar, a novel emotion-aware talking head generation framework for dyadic interactions. Leveraging the dialogue generation capability of large language models (LLMs, e.g., GPT-4), our method produces temporally consistent virtual avatars with rich emotional variations that seamlessly transition between speaking and listening states. Specifically, we design a Transformer-based head mask generator that learns temporally consistent motion features in a latent mask space, capable of generating arbitrary-length, temporally consistent mask sequences to constrain head motions. Furthermore, we introduce an interactive talking tree structure to represent dialogue state transitions, where each tree node contains information such as child/parent/sibling nodes and the current character’s emotional state. By performing reverse-level traversal, we extract rich historical emotional cues from the current node to guide expression synthesis. Extensive experiments demonstrate the superior performance and effectiveness of our method.
[516] Robust Audio-Visual Target Speaker Extraction with Emotion-Aware Multiple Enrollment Fusion
Zhan Jin, Bang Zeng, Peijun Yang, Jiarong Du, Juan Liu, Ming Li
Main category: eess.AS
TL;DR: This paper studies multimodal fusion strategies for Target Speaker Extraction under modality dropout conditions, showing that training with high dropout rates enhances robustness while voice embeddings remain consistently reliable.
Details
Motivation: Real-world applications of multimodal speech enhancement often suffer from intermittent modality dropout, which reduces system effectiveness. The research aims to understand how different fusion strategies perform under varying dropout conditions.Method: Built upon state-of-the-art audio-visual speech enhancement system with four speaker identity cues: lip embeddings, voice speaker embedding via cross-attention, static face embedding, and novel dynamic expression embedding. Systematically evaluated combinations under zero dropout and 80% modality dropout training regimes.
Result: Full multimodal ensemble achieves optimal performance under zero dropout but degrades significantly with test-time dropout. Training with 80% modality dropout dramatically enhances robustness, maintaining performance even with severe missing modalities. Voice embeddings show consistent robustness, while expression embeddings provide complementary information.
Conclusion: Training strategies must account for real-world imperfections rather than pure performance maximization. High dropout rate training enables practical reliability in multimodal speech enhancement systems, with voice embeddings being particularly robust and expression embeddings adding valuable complementary features.
Abstract: Target Speaker Extraction (TSE) is a critical challenge in cocktail party scenarios. While leveraging multiple modalities, such as voice, lip, face, and expression embeddings, can enhance performance, real-world applications often suffer from intermittent modality dropout. This paper presents a comprehensive study on the interactions and robustness of various multimodal fusion strategies under varying degrees of modality dropout. We build upon a state-of-the-art audio-visual speech enhancement system and integrate four distinct speaker identity cues: lip embeddings for synchronized contextual information, a voice speaker embedding extracted via cross-attention for acoustic consistency, a static face embedding for speaker identity, and a novel dynamic expression embedding for frame-wise emotional features. We systematically evaluate different combinations of these modalities under two key training regimes: zero dropout and 80% modality dropout. Extensive experiments demonstrate that while a full multimodal ensemble achieves optimal performance under ideal (zero dropout) conditions, its effectiveness diminishes significantly when test-time dropout occurs without prior exposure during training. Crucially, we show that training with a high (80%) modality dropout rate dramatically enhances model robustness, enabling the system to maintain superior performance even under severe test-time missing modalities. Our findings highlight that voice embeddings exhibit consistent robustness, while the proposed expression embedding provides valuable complementary information. This work underscores the importance of training strategies that account for real-world imperfection, moving beyond pure performance maximization to achieve practical reliability in multimodal speech enhancement systems.
[517] Unsupervised Estimation of Nonlinear Audio Effects: Comparing Diffusion-Based and Adversarial approaches
Eloi Moliner, Michal Švento, Alec Wright, Lauri Juvela, Pavel Rajmic, Vesa Välimäki
Main category: eess.AS
TL;DR: This paper introduces a diffusion-based method for unsupervised blind estimation of nonlinear audio effects, comparing it with adversarial approaches and showing diffusion models provide more stable results while adversarial methods excel at estimating pronounced distortion effects.
Details
Motivation: Accurately estimating nonlinear audio effects without paired input-output signals is challenging, requiring robust unsupervised approaches for blind system identification in music technology applications.Method: The study proposes a novel diffusion generative model approach for blind system identification, comparing it with adversarial methods under different parameterizations of effect operators and varying lengths of available recordings, tested on guitar distortion effects.
Result: Experiments show diffusion-based approach provides more stable results and is less sensitive to data availability, while adversarial approach is superior at estimating more pronounced distortion effects.
Conclusion: Diffusion models demonstrate strong potential for robust unsupervised blind estimation of audio effects, contributing to system identification in music technology with complementary strengths to adversarial approaches.
Abstract: Accurately estimating nonlinear audio effects without access to paired input-output signals remains a challenging problem. This work studies unsupervised probabilistic approaches for solving this task. We introduce a method, novel for this application, based on diffusion generative models for blind system identification, enabling the estimation of unknown nonlinear effects using black- and gray-box models. This study compares this method with a previously proposed adversarial approach, analyzing the performance of both methods under different parameterizations of the effect operator and varying lengths of available effected recordings. Through experiments on guitar distortion effects, we show that the diffusion-based approach provides more stable results and is less sensitive to data availability, while the adversarial approach is superior at estimating more pronounced distortion effects. Our findings contribute to the robust unsupervised blind estimation of audio effects, demonstrating the potential of diffusion models for system identification in music technology.
[518] On-device Internet of Sounds Sonification with Wavetable Synthesis Techniques for Soil Moisture Monitoring in Water Scarcity Contexts
Stephen Roddy
Main category: eess.AS
TL;DR: This paper presents a device-level sonification approach using wavetable synthesis for monitoring soil moisture levels in IoT networks, addressing water scarcity through sonic data representation.
Details
Motivation: To explore sonification at the device level rather than application/service level for IoT networks, specifically for monitoring soil moisture in the context of global water scarcity.Method: The paper formalizes an on-device wavetable sonification approach using wavetable synthesis techniques to map sensor data to acoustic parameters, with a prototype implementation.
Result: A prototype implementation of device-level sonification for soil moisture monitoring using wavetable synthesis is presented and explored.
Conclusion: The approach demonstrates the viability of device-level sonification for IoT networks, particularly for environmental monitoring applications like soil moisture tracking.
Abstract: Sonification, the mapping of data to sound to communicate information about the original data source, is becoming a viable strategy for the sonic representation and communication of information derived from the complex flows of data exchanged across Internet of Sounds (IoS) networks. This paper presents an IoS sonification implementation for monitoring soil moisture levels within the broader context of the globally increasing water scarcity. While previous work has focused on sonifications operating on the applications and services level of the IoS network infrastructure, this paper explores device-level sonification using wavetable synthesis techniques to map sensor data to acoustic parameters. An approach to on-device wavetable sonification is formalized, and a prototype implementation is presented and explored before the approach is contextualised with regard to the soil moisture monitoring tasks.
eess.IV
[519] Frequency-Aware Ensemble Learning for BraTS 2025 Pediatric Brain Tumor Segmentation
Yuxiao Yi, Qingyao Zhuang, Zhi-Qin John Xu
Main category: eess.IV
TL;DR: An ensemble method combining nnU-Net, Swin UNETR, and HFF-Net with specialized extensions for pediatric brain tumor segmentation, achieving strong Dice scores across multiple tumor regions.
Details
Motivation: Pediatric brain tumor segmentation is challenging due to rarity and heterogeneity of these malignancies, but remains critical for clinical diagnosis and treatment planning.Method: Ensemble approach integrating three models: nnU-Net with adjustable initialization scales, Swin UNETR with transfer learning from BraTS 2021, and HFF-Net with frequency domain decomposition. Final ensemble combines nnU-Net (γ=0.7), fine-tuned Swin UNETR, and HFF-Net.
Result: Achieved Dice scores of 72.3% (ET), 95.6% (NET), 68.9% (CC), 89.5% (ED), 92.3% (TC), and 92.3% (WT) on the BraTS-PED 2025 challenge.
Conclusion: The proposed ensemble method with specialized extensions effectively addresses pediatric brain tumor segmentation challenges and demonstrates strong performance across multiple tumor regions.
Abstract: Pediatric brain tumor segmentation presents unique challenges due to the rarity and heterogeneity of these malignancies, yet remains critical for clinical diagnosis and treatment planning. We propose an ensemble approach integrating nnU-Net, Swin UNETR, and HFF-Net for the BraTS-PED 2025 challenge. Our method incorporates three key extensions: adjustable initialization scales for optimal nnU-Net complexity control, transfer learning from BraTS 2021 pre-trained models to enhance Swin UNETR’s generalization on pediatric dataset, and frequency domain decomposition for HFF-Net to separate low-frequency tissue contours from high-frequency texture details. Our final ensemble combines nnU-Net ($\gamma=0.7$), fine-tuned Swin UNETR, and HFF-Net, achieving Dice scores of 72.3% (ET), 95.6% (NET), 68.9% (CC), 89.5% (ED), 92.3% (TC), and 92.3% (WT), respectively.
[520] BALANCE: Bitrate-Adaptive Limit-Aware Netcast Content Enhancement Utilizing QUBO and Quantum Annealing
Animesh Rajpurohit, Michael Kelley, Wei Wang, Krishna Murthy Kattiyan Ramamoorthy
Main category: eess.IV
TL;DR: BALANCE is a quantum framework that optimizes video streaming quality under data caps by intelligently pre-selecting video segments using visual complexity and VMAF metrics, outperforming traditional ABR methods.
Details
Motivation: Address the challenge of optimizing video streaming quality while adhering to user-defined data caps in an era of increasing data constraints.Method: Uses Quantum framework with Bitrate-Adaptive Limit-Aware Netcast Content Enhancement (BALANCE), pre-selects video segments based on visual complexity and data consumption, employs VMAF metric for QoE enhancement, and formulates bitrate allocation as QUBO problem comparing Slack variable vs Dynamic Penalization Approach.
Result: DPA consistently outperforms Slack Variable Method, delivering more valid and optimal solutions as data limits increase, with notable improvement in QoE under equivalent data constraints compared to traditional bitrate ladders.
Conclusion: The quantum approach significantly enhances streaming satisfaction for users with limited data plans by effectively enforcing data limits while optimizing video quality.
Abstract: In an era of increasing data cap constraints, optimizing video streaming quality while adhering to user-defined data caps remains a significant challenge. This paper introduces Bitrate-Adaptive Limit-Aware Netcast Content Enhancement (BALANCE), a novel Quantum framework aimed at addressing this issue. BALANCE intelligently pre-selects video segments based on visual complexity and anticipated data consumption, utilizing the Video Multimethod Assessment Fusion (VMAF) metric to enhance Quality of Experience (QoE). We compare our method against traditional bitrate ladders used in Adaptive Bitrate (ABR) streaming, demonstrating a notable improvement in QoE under equivalent data constraints. We compare the Slack variable approach with the Dynamic Penalization Approach (DPA) by framing the bitrate allocation problem through Quadratic Unconstrained Binary Optimization (QUBO) to effectively enforce data limits. Our results indicate that the DPA consistently outperforms the Slack Variable Method, delivering more valid and optimal solutions as data limits increase. This new quantum approach significantly enhances streaming satisfaction for users with limited data plans.
[521] Ensuring Reliable Participation in Subjective Video Quality Tests Across Platforms
Babak Naderi, Ross Cutler
Main category: eess.IV
TL;DR: The paper proposes objective and subjective detectors for remote-desktop (RD) users in crowdsourced video quality assessment (VQA) to address reliability issues from worker exploits.
Details
Motivation: Crowdsourcing offers efficient VQA evaluation but suffers from unreliable submissions due to workers exploiting video metadata and using remote-desktop connections, which bias results.Method: The authors develop and compare objective and subjective detectors for identifying RD users and evaluate two mainstream crowdsourcing platforms on their susceptibility and mitigation capabilities under realistic test conditions.
Result: The study reveals the extent of RD connection usage and metadata exploits in crowdsourced VQA, providing insights into platform vulnerabilities.
Conclusion: Effective detection methods for RD users are necessary to maintain the reliability and validity of crowdsourced video quality assessments.
Abstract: Subjective video quality assessment (VQA) is the gold standard for measuring end-user experience across communication, streaming, and UGC pipelines. Beyond high-validity lab studies, crowdsourcing offers accurate, reliable, faster, and cheaper evaluation-but suffers from unreliable submissions by workers who ignore instructions or game rewards. Recent tests reveal sophisticated exploits of video metadata and rising use of remote-desktop (RD) connections, both of which bias results. We propose objective and subjective detectors for RD users and compare two mainstream crowdsourcing platforms on their susceptibility and mitigation under realistic test conditions and task designs.
[522] Infrared Image Super-Resolution: Systematic Review, and Future Trends
Yongsong Huang, Tomo Miyazaki, Xiaofeng Liu, Shinichiro Omachi
Main category: eess.IV
TL;DR: This paper provides a comprehensive survey of infrared image super-resolution, covering applications, hardware challenges, methodologies, datasets, evaluation metrics, and future directions.
Details
Motivation: Infrared image super-resolution is crucial for computer vision tasks but faces unique challenges compared to conventional image SR. The field is rapidly developing and needs systematic organization to guide future research.Method: The authors conduct a systematic literature review and analysis of infrared image super-resolution techniques, categorizing methodologies and identifying key challenges in hardware systems and image processing approaches.
Result: The survey presents a comprehensive taxonomy of IR image SR methods, discusses current limitations, and highlights promising research directions. It also provides an updated repository of relevant work.
Conclusion: Infrared image super-resolution remains an active research area with significant potential. The survey serves as a valuable resource for researchers and will be regularly updated to keep pace with advancements in the field.
Abstract: Image Super-Resolution (SR) is essential for a wide range of computer vision and image processing tasks. Investigating infrared (IR) image (or thermal images) super-resolution is a continuing concern within the development of deep learning. This survey aims to provide a comprehensive perspective of IR image super-resolution, including its applications, hardware imaging system dilemmas, and taxonomy of image processing methodologies. In addition, the datasets and evaluation metrics in IR image super-resolution tasks are also discussed. Furthermore, the deficiencies in current technologies and possible promising directions for the community to explore are highlighted. To cope with the rapid development in this field, we intend to regularly update the relevant excellent work at https://github.com/yongsongH/Infrared_Image_SR_Survey.
[523] Scan-Adaptive MRI Undersampling Using Neighbor-based Optimization (SUNO)
Siddhant Gautam, Angqi Li, Nicole Seiberlich, Jeffrey A. Fessler, Saiprasad Ravishankar
Main category: eess.IV
TL;DR: Proposes SUNO framework for scan-adaptive Cartesian undersampling patterns in accelerated MRI, using alternating optimization to learn patient-specific sampling patterns and reconstruction models.
Details
Motivation: Current population-adaptive sampling patterns are sub-optimal for individual scans as they fail to capture scan-specific details and depend on population composition. Need for scan-adaptive approaches.Method: Alternating algorithm with iterative coordinate descent (ICD) optimization for scan-adaptive k-space sampling patterns per training example. Uses nearest neighbor search at test time based on low-frequency k-space information.
Result: Applied to fastMRI multi-coil knee and brain datasets, showing improved performance over current undersampling patterns at 4× and 8× acceleration factors in both visual quality and quantitative metrics.
Conclusion: SUNO framework enables effective scan-adaptive sampling for accelerated MRI, outperforming existing approaches and providing better reconstruction quality.
Abstract: Accelerated MRI involves collecting partial $k$-space measurements to reduce acquisition time, patient discomfort, and motion artifacts, and typically uses regular undersampling patterns or human-designed schemes. Recent works have studied population-adaptive sampling patterns learned from a group of patients (or scans). However, such patterns can be sub-optimal for individual scans, as they may fail to capture scan or slice-specific details, and their effectiveness can depend on the size and composition of the population. To overcome this issue, we propose a framework for jointly learning scan-adaptive Cartesian undersampling patterns and a corresponding reconstruction model from a training set. We use an alternating algorithm for learning the sampling patterns and the reconstruction model where we use an iterative coordinate descent (ICD) based offline optimization of scan-adaptive $k$-space sampling patterns for each example in the training set. A nearest neighbor search is then used to select the scan-adaptive sampling pattern at test time from initially acquired low-frequency $k$-space information. We applied the proposed framework (dubbed SUNO) to the fastMRI multi-coil knee and brain datasets, demonstrating improved performance over the currently used undersampling patterns at both $4\times$ and $8\times$ acceleration factors in terms of both visual quality and quantitative metrics. The code for the proposed framework is available at https://github.com/sidgautam95/adaptive-sampling-mri-suno.
[524] Imaging Biomarkers for Neurodegenerative Diseases from Detailed Segmentation of Medial Temporal Lobe Subregions on in vivo Brain MRI Using Upsampling Strategy Guided by High-resolution ex vivo MRI
Yue Li, Pulkit Khandelwal, Long Xie, Laura E. M. Wisse, Amanda E. Denning, Christopher A. Brown, Emily McGrew, Sydney A. Lim, Niyousha Sadeghpour, Sadhana Ravikumar, Ranjit Ittyerah, Eunice Chung, Daniel T. Ohm, Nidhi S. Mundada, María Mercedes Íñiguez de Onzoño Martín, María del Mar Arroyo Jiménez, Monica Mũnoz, Maria del Pilar Marcos Rabal, David J. Irwin, Edward B. Lee, Ricardo Insausti, Sandhitsu R. Das, David A. Wolk, Paul A. Yushkevich
Main category: eess.IV
TL;DR: A multi-modality MTL segmentation algorithm that combines T1w and T2w MRI by bringing both to nearly isotropic voxel space, improving Alzheimer’s disease diagnosis and monitoring.
Details
Motivation: The medial temporal lobe (MTL) is affected early in Alzheimer's disease, and different MRI modalities have distinct advantages for MTL morphometry - T2w for hippocampal subfields and T1w for extra-hippocampal regions. Current methods operate in anisotropic spaces, limiting their effectiveness.Method: Proposed a multi-modality segmentation algorithm that bridges T1w and T2w MRI by upsampling both to nearly isotropic voxel space using a model guided by high-resolution ex vivo 9.4T MRI. Combined with non-local means upsampling to create training data, then trained a nnUNet model.
Result: Biomarkers extracted using the proposed model showed greater ability to discriminate between mild cognitive impairment and cognitively unimpaired individuals, and had greater longitudinal stability compared to conventional models operating in anisotropic spaces.
Conclusion: Biomarkers derived from T1w and T2w MRI upsampled to nearly isotropic resolution have significant potential for improving Alzheimer’s disease diagnosis and monitoring disease progression.
Abstract: The medial temporal lobe (MTL) is a region impacted extensively and non-uniformly in early stages of Alzheimer’s disease (AD). Regional MTL morphometric measures extracted from magnetic resonance imaging (MRI) are supportive features for the diagnosis of AD and related disorders (ADRD). Different MRI modalities have distinct advantages for MTL morphometry. Anisotropic T2-weighted (T2w) MRI is preferred for hippocampal subfields due to its higher contrast between hippocampal layers. Isotropic T1-weighted (T1w) MRI is beneficial for thickness calculation of extra-hippocampal subregions due to its stable image quality and isotropic resolution. We propose a multi-modality MTL segmentation algorithm that bridges the T1w and T2w modalities by bringing both to a nearly isotropic voxel space. Guided by high-resolution ex vivo 9.4T MRI, an upsampling model was designed for the ground truth segmentations. Combined with non-local means upsampling, this model was used to construct a nearly iso-tropic T1w and T2w MTL subregion segmentation training set, which was used to train a nnUNet model. Morphometric biomarkers extracted by this model were compared to those extracted using conventional models operating in anisotropic spaces on downstream tasks. Biomarkers extracted using the proposed model had greater ability to discriminate between individuals with mild cognitive impairment and cognitively unimpaired; and had great-er longitudinal stability. These findings suggest that the biomarkers derived from T1w and T2w MRI unsampled to nearly isotropic resolution have sig-nificant potential for improving disease diagnosis and monitoring disease progression in ADRD.
[525] HAZEMATCHING: Dehazing Light Microscopy Images with Guided Conditional Flow Matching
Anirban Ray, Ashesh, Florian Jug
Main category: eess.IV
TL;DR: HazeMatching is a novel iterative method for dehazing light microscopy images that balances fidelity and realism by adapting conditional flow matching framework with hazy observation guidance.
Details
Motivation: Existing computational dehazing methods either prioritize fidelity at the expense of realism or produce perceptually convincing results that lack quantitative accuracy. The goal is to find a balanced trade-off between fidelity and realism for microscopy images.Method: HazeMatching adapts the conditional flow matching framework by guiding the generative process with a hazy observation in the conditional velocity field. It does not require an explicit degradation operator.
Result: Evaluated on 5 datasets (synthetic and real data), HazeMatching achieves consistent balance between distortion and perceptual quality compared to 7 baselines, with well-calibrated predictions.
Conclusion: The method effectively balances fidelity and realism in microscopy image dehazing, works without explicit degradation operators, and will be publicly available with permissive licensing.
Abstract: Fluorescence microscopy is a major driver of scientific progress in the life sciences. Although high-end confocal microscopes are capable of filtering out-of-focus light, cheaper and more accessible microscopy modalities, such as widefield microscopy, can not, which consequently leads to hazy image data. Computational dehazing is trying to combine the best of both worlds, leading to cheap microscopy but crisp-looking images. The perception-distortion trade-off tells us that we can optimize either for data fidelity, e.g. low MSE or high PSNR, or for data realism, measured by perceptual metrics such as LPIPS or FID. Existing methods either prioritize fidelity at the expense of realism, or produce perceptually convincing results that lack quantitative accuracy. In this work, we propose HazeMatching, a novel iterative method for dehazing light microscopy images, which effectively balances these objectives. Our goal was to find a balanced trade-off between the fidelity of the dehazing results and the realism of individual predictions (samples). We achieve this by adapting the conditional flow matching framework by guiding the generative process with a hazy observation in the conditional velocity field. We evaluate HazeMatching on 5 datasets, covering both synthetic and real data, assessing both distortion and perceptual quality. Our method is compared against 7 baselines, achieving a consistent balance between fidelity and realism on average. Additionally, with calibration analysis, we show that HazeMatching produces well-calibrated predictions. Note that our method does not need an explicit degradation operator to exist, making it easily applicable on real microscopy data. All data used for training and evaluation and our code will be publicly available under a permissive license.
[526] An on-chip Pixel Processing Approach with 2.4μs latency for Asynchronous Read-out of SPAD-based dToF Flash LiDARs
Yiyang Liu, Rongxuan Zhang, Istvan Gyongy, Alistair Gorman, Sarrah M. Patanwala, Filip Taneski, Robert K. Henderson
Main category: eess.IV
TL;DR: A fully asynchronous peak detection approach for SPAD-based dToF flash LiDAR that enables pixel-wise event-driven depth acquisition without global synchronization, reducing latency and motion blur while increasing effective frame rate.
Details
Motivation: To overcome limitations of frame-based LiDAR systems by enabling asynchronous operation that reduces redundant background data and computational load, while achieving lower latency and better performance in dynamic scenarios.Method: Proposes an asynchronous peak detection approach where pixels independently report depth once sufficient signal-to-noise ratio is achieved. Validated through two hardware implementations: offline 256×128 SPAD array with PC processing and real-time FPGA prototype with 2.4μs latency. Also derived a semi-closed-form solution for detection probability of raw-peak finding LiDAR systems.
Result: Demonstrates robust depth estimation, reflectivity reconstruction, and dynamic event-based representation under static and dynamic conditions. The method reduces latency, mitigates motion blur, and increases effective frame rate compared to frame-based systems.
Conclusion: Establishes foundation for compact, low-latency, event-driven LiDAR architectures suitable for robotics, autonomous driving, and consumer applications. The approach remains tunable via simple hyperparameters and benefits both conventional and proposed systems.
Abstract: We propose a fully asynchronous peak detection approach for SPAD-based direct time-of-flight (dToF) flash LiDAR, enabling pixel-wise event-driven depth acquisition without global synchronization. By allowing pixels to independently report depth once a sufficient signal-to-noise ratio is achieved, the method reduces latency, mitigates motion blur, and increases effective frame rate compared to frame-based systems. The framework is validated under two hardware implementations: an offline 256$\times$128 SPAD array with PC based processing and a real-time FPGA proof-of-concept prototype with 2.4$\upmu$s latency for on-chip integration. Experiments demonstrate robust depth estimation, reflectivity reconstruction, and dynamic event-based representation under both static and dynamic conditions. The results confirm that asynchronous operation reduces redundant background data and computational load, while remaining tunable via simple hyperparameters. These findings establish a foundation for compact, low-latency, event-driven LiDAR architectures suited to robotics, autonomous driving, and consumer applications. In addition, we have derived a semi-closed-form solution for the detection probability of the raw-peak finding based LiDAR systems that could benefit both conventional frame-based and proposed asynchronous LiDAR systems.
[527] MOIS-SAM2: Exemplar-based Segment Anything Model 2 for multilesion interactive segmentation of neurofibromas in whole-body MRI
Georgii Kolokolnikov, Marie-Lena Schmalhofer, Sophie Goetz, Lennart Well, Said Farschtschi, Victor-Felix Mautner, Inka Ristow, Rene Werner
Main category: eess.IV
TL;DR: MOIS-SAM2 is a novel interactive segmentation model that extends SAM2 with exemplar-based semantic propagation for efficient segmentation of neurofibromas in whole-body MRI, achieving superior performance over baseline methods with strong generalization across domain shifts.
Details
Motivation: Existing interactive segmentation methods fail to combine high lesion-wise precision with scalability to hundreds of lesions in neurofibromatosis type 1 (NF1) patients, where whole-body MRI is used for tumor detection and surveillance.Method: MOIS-SAM2 extends the transformer-based Segment Anything Model 2 (SAM2) with exemplar-based semantic propagation. It was trained and evaluated on 119 WB-MRI scans from 84 NF1 patients, with testing across different domain shift scenarios including MRI field strength variation, low tumor burden, and scanner vendor differences.
Result: MOIS-SAM2 achieved scan-wise DSC of 0.60 on in-domain test set, outperforming 3D nnU-Net (DSC: 0.54) and SAM2 (DSC: 0.35). Performance was maintained under domain shifts (DSC: 0.50-0.53) and improved in low tumor burden cases (DSC: 0.61). Model-to-expert agreement (DSC: 0.62-0.68) was comparable to inter-expert agreement (DSC: 0.57-0.69).
Conclusion: MOIS-SAM2 enables efficient and scalable interactive segmentation of neurofibromas in WB-MRI with minimal user input and strong generalization, supporting integration into clinical workflows for NF1 tumor surveillance.
Abstract: Background and Objectives: Neurofibromatosis type 1 is a genetic disorder characterized by the development of numerous neurofibromas (NFs) throughout the body. Whole-body MRI (WB-MRI) is the clinical standard for detection and longitudinal surveillance of NF tumor growth. Existing interactive segmentation methods fail to combine high lesion-wise precision with scalability to hundreds of lesions. This study proposes a novel interactive segmentation model tailored to this challenge. Methods: We introduce MOIS-SAM2, a multi-object interactive segmentation model that extends the state-of-the-art, transformer-based, promptable Segment Anything Model 2 (SAM2) with exemplar-based semantic propagation. MOIS-SAM2 was trained and evaluated on 119 WB-MRI scans from 84 NF1 patients acquired using T2-weighted fat-suppressed sequences. The dataset was split at the patient level into a training set and four test sets (one in-domain and three reflecting different domain shift scenarios, e.g., MRI field strength variation, low tumor burden, differences in clinical site and scanner vendor). Results: On the in-domain test set, MOIS-SAM2 achieved a scan-wise DSC of 0.60 against expert manual annotations, outperforming baseline 3D nnU-Net (DSC: 0.54) and SAM2 (DSC: 0.35). Performance of the proposed model was maintained under MRI field strength shift (DSC: 0.53) and scanner vendor variation (DSC: 0.50), and improved in low tumor burden cases (DSC: 0.61). Lesion detection F1 scores ranged from 0.62 to 0.78 across test sets. Preliminary inter-reader variability analysis showed model-to-expert agreement (DSC: 0.62-0.68), comparable to inter-expert agreement (DSC: 0.57-0.69). Conclusions: The proposed MOIS-SAM2 enables efficient and scalable interactive segmentation of NFs in WB-MRI with minimal user input and strong generalization, supporting integration into clinical workflows.