Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 75]
cs.CV [Total: 98]
cs.AI [Total: 30]
cs.SD [Total: 13]
cs.LG [Total: 92]
cs.MA [Total: 2]
cs.MM [Total: 1]
eess.AS [Total: 12]
eess.IV [Total: 13]

cs.CL

[1] Structured Information Matters: Explainable ICD Coding with Patient-Level Knowledge Graphs

Mingyang Li, Viktor Schlegel, Tingting Mu, Warren Del-Pinto, Goran Nenadic

Main category: cs.CL

TL;DR: This paper presents a method for automated ICD coding using knowledge graphs to represent clinical documents, achieving improved performance and efficiency compared to text-only approaches.

Details

Motivation: Manual coding of clinical documents to standardized vocabularies like ICD is difficult and time-consuming. Automated coding can improve availability and accuracy of structured clinical data, but current methods underutilize external resources for input document representation.

Method: The authors compute structured representations of input documents using document-level knowledge graphs that provide a comprehensive structured view of patient conditions. This KG representation retains 90% of information with only 23% of original text. They integrate this into the PLM-ICD architecture for automated ICD-9 coding.

Result: Experiments show improved Macro-F1 scores by up to 3.20% on popular benchmarks while improving training efficiency. The KG approach also offers better explainability compared to text-only baselines.

Conclusion: Using knowledge graphs to represent clinical documents significantly enhances automated ICD coding performance, efficiency, and explainability by effectively capturing different types of entities and relationships in patient data.

Abstract: Mapping clinical documents to standardised clinical vocabularies is an important task, as it provides structured data for information retrieval and analysis, which is essential to clinical research, hospital administration and improving patient care. However, manual coding is both difficult and time-consuming, making it impractical at scale. Automated coding can potentially alleviate this burden, improving the availability and accuracy of structured clinical data. The task is difficult to automate, as it requires mapping to high-dimensional and long-tailed target spaces, such as the International Classification of Diseases (ICD). While external knowledge sources have been readily utilised to enhance output code representation, the use of external resources for representing the input documents has been underexplored. In this work, we compute a structured representation of the input documents, making use of document-level knowledge graphs (KGs) that provide a comprehensive structured view of a patient’s condition. The resulting knowledge graph efficiently represents the patient-centred input documents with 23% of the original text while retaining 90% of the information. We assess the effectiveness of this graph for automated ICD-9 coding by integrating it into the state-of-the-art ICD coding architecture PLM-ICD. Our experiments yield improved Macro-F1 scores by up to 3.20% on popular benchmarks, while improving training efficiency. We attribute this improvement to different types of entities and relationships in the KG, and demonstrate the improved explainability potential of the approach over the text-only baseline.

[2] Cross-Layer Attention Probing for Fine-Grained Hallucination Detection

Malavika Suresh, Rahaf Aljundi, Ikechukwu Nkisi-Orji, Nirmalie Wiratunga

Main category: cs.CL

TL;DR: CLAP is a novel activation probing technique that processes LLM activations across the entire residual stream to detect hallucinations, outperforming baselines and enabling fine-grained detection across different sampled responses.

Details

Motivation: Address growing reliability concerns with LLMs due to their tendency to generate inaccurate text (hallucinations) in various applications.

Method: Cross-Layer Attention Probing (CLAP) - processes LLM activations across the entire residual stream as a joint sequence for hallucination detection.

Result: CLAP improves hallucination detection compared to baselines across five LLMs and three tasks, works on both greedy decoded and higher-temperature sampled responses, enables fine-grained detection, and maintains high reliability even out-of-distribution.

Conclusion: CLAP enables a detect-then-mitigate strategy that reduces hallucinations and improves LLM reliability compared to direct mitigation approaches.

Abstract: With the large-scale adoption of Large Language Models (LLMs) in various applications, there is a growing reliability concern due to their tendency to generate inaccurate text, i.e. hallucinations. In this work, we propose Cross-Layer Attention Probing (CLAP), a novel activation probing technique for hallucination detection, which processes the LLM activations across the entire residual stream as a joint sequence. Our empirical evaluations using five LLMs and three tasks show that CLAP improves hallucination detection compared to baselines on both greedy decoded responses as well as responses sampled at higher temperatures, thus enabling fine-grained detection, i.e. the ability to disambiguate hallucinations and non-hallucinations among different sampled responses to a given prompt. This allows us to propose a detect-then-mitigate strategy using CLAP to reduce hallucinations and improve LLM reliability compared to direct mitigation approaches. Finally, we show that CLAP maintains high reliability even when applied out-of-distribution.

[3] Optimal Multi-Task Learning at Regularization Horizon for Speech Translation Task

JungHo Jung, Junhyun Lee

Main category: cs.CL

TL;DR: This paper explores multi-task learning for speech-to-text translation by investigating three regularization sources: cross-modal consistency, same-modal R-drop, and MT loss coefficient, introducing a “regularization horizon” concept that achieves near state-of-the-art performance.

Details

Motivation: End-to-end speech-to-text translation suffers from scarce paired speech-text data. The authors aim to overcome this limitation by leveraging machine translation bitext data through multi-task learning with effective regularization techniques.

Method: The paper formulates multi-task learning from a regularization perspective, investigating three regularization sources: 1) consistency regularization across different modalities, 2) R-drop regularization within the same modality, and 3) MT loss coefficient as an additional regularization source. They introduce the concept of “regularization horizon” to optimize these parameters.

Result: Experiments on the MuST-C dataset show that tuning hyperparameters within the proposed regularization horizon achieves near state-of-the-art performance in speech-to-text translation.

Conclusion: The study demonstrates that effectively combining multiple regularization sources (cross-modal consistency, same-modal R-drop, and MT loss coefficient) through the proposed regularization horizon framework significantly improves speech-to-text translation performance when paired data is scarce.

Abstract: End-to-end speech-to-text translation typically suffers from the scarcity of paired speech-text data. One way to overcome this shortcoming is to utilize the bitext data from the Machine Translation (MT) task and perform Multi-Task Learning (MTL). In this paper, we formulate MTL from a regularization perspective and explore how sequences can be regularized within and across modalities. By thoroughly investigating the effect of consistency regularization (different modality) and R-drop (same modality), we show how they respectively contribute to the total regularization. We also demonstrate that the coefficient of MT loss serves as another source of regularization in the MTL setting. With these three sources of regularization, we introduce the optimal regularization contour in the high-dimensional space, called the regularization horizon. Experiments show that tuning the hyperparameters within the regularization horizon achieves near state-of-the-art performance on the MuST-C dataset.

[4] Creativity Benchmark: A benchmark for marketing creativity for LLM models

Ninad Bhat, Kieran Browne, Pip Bingemann

Main category: cs.CL

TL;DR: Creativity Benchmark evaluates LLMs for marketing creativity across 100 brands and 3 prompt types, showing tightly clustered model performance with no clear winner and highlighting limitations of automated evaluation.

Details

Motivation: To develop a comprehensive evaluation framework for assessing large language models' creative capabilities in marketing contexts, addressing the need for brand-specific and expert-driven creativity assessment.

Method: Used 100 brands across 12 categories with three prompt types (Insights, Ideas, Wild Ideas). Collected 11,012 pairwise preferences from 678 professional creatives, analyzed with Bradley-Terry models. Also measured model diversity using cosine distances and sensitivity to prompt reframing.

Result: Models showed tightly clustered performance (Δθ ≈ 0.45) with no dominant model across brands or prompt types. Top model beats lowest only 61% of time. LLM-as-judge setups showed weak correlations with human rankings and judge-specific biases. Conventional creativity tests only partially transfer to brand-constrained tasks.

Conclusion: Expert human evaluation is essential for assessing marketing creativity in LLMs, automated judges cannot substitute humans, and diversity-aware workflows are needed for effective creative applications.

Abstract: We introduce Creativity Benchmark, an evaluation framework for large language models (LLMs) in marketing creativity. The benchmark covers 100 brands (12 categories) and three prompt types (Insights, Ideas, Wild Ideas). Human pairwise preferences from 678 practising creatives over 11,012 anonymised comparisons, analysed with Bradley-Terry models, show tightly clustered performance with no model dominating across brands or prompt types: the top-bottom spread is $\Delta\theta \approx 0.45$, which implies a head-to-head win probability of $0.61$; the highest-rated model beats the lowest only about $61%$ of the time. We also analyse model diversity using cosine distances to capture intra- and inter-model variation and sensitivity to prompt reframing. Comparing three LLM-as-judge setups with human rankings reveals weak, inconsistent correlations and judge-specific biases, underscoring that automated judges cannot substitute for human evaluation. Conventional creativity tests also transfer only partially to brand-constrained tasks. Overall, the results highlight the need for expert human evaluation and diversity-aware workflows.

[5] CTCC: A Robust and Stealthy Fingerprinting Framework for Large Language Models via Cross-Turn Contextual Correlation Backdoor

Zhenhua Xu, Xixiang Zhao, Xubin Yue, Shengwei Tian, Changting Lin, Meng Han

Main category: cs.CL

TL;DR: CTCC is a novel rule-driven fingerprinting framework that embeds ownership traces in LLMs using contextual correlations across multiple dialogue turns, achieving better stealth and robustness than existing methods.

Details

Motivation: Address concerns around intellectual property protection for LLMs as model theft and unauthorized redistribution become increasingly feasible, overcoming limitations of existing fingerprinting methods that trade off stealthness, robustness, and generalizability.

Method: Rule-driven framework that encodes contextual correlations across multiple dialogue turns (e.g., counterfactual) rather than relying on token-level or single-turn triggers, enabling black-box verification while mitigating false positives and fingerprint leakage.

Result: Extensive experiments across multiple LLM architectures demonstrate CTCC consistently achieves stronger stealth and robustness than prior work, supporting continuous construction even if partial triggers are exposed.

Conclusion: CTCC provides a reliable and practical solution for ownership verification in real-world LLM deployment scenarios, with publicly available code and data.

Abstract: The widespread deployment of large language models (LLMs) has intensified concerns around intellectual property (IP) protection, as model theft and unauthorized redistribution become increasingly feasible. To address this, model fingerprinting aims to embed verifiable ownership traces into LLMs. However, existing methods face inherent trade-offs between stealthness, robustness, and generalizability, being either detectable via distributional shifts, vulnerable to adversarial modifications, or easily invalidated once the fingerprint is revealed. In this work, we introduce CTCC, a novel rule-driven fingerprinting framework that encodes contextual correlations across multiple dialogue turns, such as counterfactual, rather than relying on token-level or single-turn triggers. CTCC enables fingerprint verification under black-box access while mitigating false positives and fingerprint leakage, supporting continuous construction under a shared semantic rule even if partial triggers are exposed. Extensive experiments across multiple LLM architectures demonstrate that CTCC consistently achieves stronger stealth and robustness than prior work. Our findings position CTCC as a reliable and practical solution for ownership verification in real-world LLM deployment scenarios. Our code and data are publicly available at https://github.com/Xuzhenhua55/CTCC.

[6] MultimodalHugs: Enabling Sign Language Processing in Hugging Face

Gerard Sant, Zifan Jiang, Carlos Escolano, Amit Moryossef, Mathias Müller, Rico Sennrich, Sarah Ebling

Main category: cs.CL

TL;DR: MultimodalHugs is a framework built on Hugging Face to address reproducibility and flexibility issues in sign language processing by supporting diverse data modalities like pose estimation and pixel data.

Details

Motivation: Sign language processing research faces challenges with complex ad-hoc code, low reproducibility, and unfair comparisons. Existing tools like Hugging Face lack flexibility for sign language experiments, as confirmed by a survey of SLP researchers.

Method: Developed MultimodalHugs framework on top of Hugging Face, adding an abstraction layer to support diverse data modalities and tasks beyond standard templates, with focus on sign languages but applicable to other multimodal use cases.

Result: The framework successfully accommodates diverse modalities including pose estimation data for sign languages and pixel data for text characters, as demonstrated through quantitative experiments.

Conclusion: MultimodalHugs provides a solution to reproducibility and flexibility challenges in SLP research by extending Hugging Face’s capabilities while maintaining its ecosystem advantages, making it suitable for both sign language and other multimodal applications.

Abstract: In recent years, sign language processing (SLP) has gained importance in the general field of Natural Language Processing. However, compared to research on spoken languages, SLP research is hindered by complex ad-hoc code, inadvertently leading to low reproducibility and unfair comparisons. Existing tools that are built for fast and reproducible experimentation, such as Hugging Face, are not flexible enough to seamlessly integrate sign language experiments. This view is confirmed by a survey we conducted among SLP researchers. To address these challenges, we introduce MultimodalHugs, a framework built on top of Hugging Face that enables more diverse data modalities and tasks, while inheriting the well-known advantages of the Hugging Face ecosystem. Even though sign languages are our primary focus, MultimodalHugs adds a layer of abstraction that makes it more widely applicable to other use cases that do not fit one of the standard templates of Hugging Face. We provide quantitative experiments to illustrate how MultimodalHugs can accommodate diverse modalities such as pose estimation data for sign languages, or pixel data for text characters.

[7] Temporal Preferences in Language Models for Long-Horizon Assistance

Ali Mazyaki, Mohammad Naghizadeh, Samaneh Ranjkhah Zonouzaghi, Hossein Setareh

Main category: cs.CL

TL;DR: Language models show future vs present time preferences that can be systematically manipulated through prompting, with reasoning-focused models exhibiting stronger future orientation under appropriate prompts.

Details

Motivation: To understand whether language models exhibit intertemporal preferences (future vs present orientation) similar to humans, and whether these preferences can be systematically manipulated through different prompting strategies.

Method: Adapted human experimental protocols to evaluate multiple LMs on time-tradeoff tasks, benchmarked against human decision makers. Introduced Manipulability of Time Orientation (MTO) metric to measure preference changes between future- and present-oriented prompts.

Result: Reasoning-focused models (DeepSeek-Reasoner, grok-3-mini) choose later options under future-oriented prompts but only partially personalize decisions across identities/geographies. Models that reason correctly about time orientation internalize future orientation for themselves as AI decision makers.

Conclusion: The findings have design implications for AI assistants that need to align with long-horizon goals, and outline a research agenda for personalized contextual calibration and socially aware deployment of language models.

Abstract: We study whether language models (LMs) exhibit future- versus present-oriented preferences in intertemporal choice and whether those preferences can be systematically manipulated. Using adapted human experimental protocols, we evaluate multiple LMs on time-tradeoff tasks and benchmark them against a sample of human decision makers. We introduce an operational metric, the Manipulability of Time Orientation (MTO), defined as the change in an LM’s revealed time preference between future- and present-oriented prompts. In our tests, reasoning-focused models (e.g., DeepSeek-Reasoner and grok-3-mini) choose later options under future-oriented prompts but only partially personalize decisions across identities or geographies. Moreover, models that correctly reason about time orientation internalize a future orientation for themselves as AI decision makers. We discuss design implications for AI assistants that should align with heterogeneous, long-horizon goals and outline a research agenda on personalized contextual calibration and socially aware deployment.

[8] The Non-Determinism of Small LLMs: Evidence of Low Answer Consistency in Repetition Trials of Standard Multiple-Choice Benchmarks

Claudio Pinhanez, Paulo Cavalin, Cassia Sanctos, Marcelo Grave, Yago Primerano

Main category: cs.CL

TL;DR: Study examines answer consistency of small LLMs (2B-8B parameters) when answering the same question multiple times, finding 50%-80% consistency at low temperatures and correlation between consistent answer accuracy and overall accuracy.

Details

Motivation: To understand how consistently small language models answer the same question across multiple trials and explore the trade-offs between consistency and accuracy.

Method: Analyzed open-source LLMs responding to 10 repetitions of questions from MMLU-Redux and MedQA benchmarks, varying inference temperatures, model sizes (small vs medium), and comparing finetuned vs base models.

Result: Small models show 50%-80% question consistency at low temperatures, with consistent answer accuracy correlating with overall accuracy. Medium-sized models demonstrate much higher consistency levels.

Conclusion: Small LLMs exhibit moderate consistency (50%-80%) at low temperatures, and consistency-accuracy correlation suggests reliable models maintain both properties, while medium models show superior consistency performance.

Abstract: This work explores the consistency of small LLMs (2B-8B parameters) in answering multiple times the same question. We present a study on known, open-source LLMs responding to 10 repetitions of questions from the multiple-choice benchmarks MMLU-Redux and MedQA, considering different inference temperatures, small vs. medium models (50B-80B), finetuned vs. base models, and other parameters. We also look into the effects of requiring multi-trial answer consistency on accuracy and the trade-offs involved in deciding which model best provides both of them. To support those studies, we propose some new analytical and graphical tools. Results show that the number of questions which can be answered consistently vary considerably among models but are typically in the 50%-80% range for small models at low inference temperatures. Also, accuracy among consistent answers seems to reasonably correlate with overall accuracy. Results for medium-sized models seem to indicate much higher levels of answer consistency.

[9] Beyond I’m Sorry, I Can’t: Dissecting Large Language Model Refusal

Nirmalendu Prakash, Yeo Wei Jie, Amir Abdullah, Ranjan Satapathy, Erik Cambria, Roy Ka Wei Lee

Main category: cs.CL

TL;DR: Researchers developed a pipeline to identify jailbreak-critical features in LLMs by analyzing sparse autoencoder activations, revealing causal mechanisms behind refusal behavior and feature redundancy.

Details

Motivation: To understand the internal causes of refusal behavior in instruction-tuned LLMs on harmful prompts, as this safety mechanism remains poorly understood despite its importance.

Method: Used sparse autoencoders on residual-stream activations of Gemma-2-2B-IT and LLaMA-3.1-8B-IT models. Developed a three-stage pipeline: finding refusal-mediating direction, greedy filtering to minimal feature set, and fitting factorization machines to capture nonlinear interactions.

Result: Identified a broad set of jailbreak-critical features that causally influence refusal behavior. Found evidence of redundant features that remain dormant unless earlier features are suppressed, demonstrating feature redundancy in safety mechanisms.

Conclusion: The approach enables fine-grained auditing and targeted intervention in safety behaviors by manipulating interpretable latent spaces, providing mechanistic insights into refusal behavior in LLMs.

Abstract: Refusal on harmful prompts is a key safety behaviour in instruction-tuned large language models (LLMs), yet the internal causes of this behaviour remain poorly understood. We study two public instruction-tuned models, Gemma-2-2B-IT and LLaMA-3.1-8B-IT, using sparse autoencoders (SAEs) trained on residual-stream activations. Given a harmful prompt, we search the SAE latent space for feature sets whose ablation flips the model from refusal to compliance, demonstrating causal influence and creating a jailbreak. Our search proceeds in three stages: (1) Refusal Direction: find a refusal-mediating direction and collect SAE features near that direction; (2) Greedy Filtering: prune to a minimal set; and (3) Interaction Discovery: fit a factorization machine (FM) that captures nonlinear interactions among the remaining active features and the minimal set. This pipeline yields a broad set of jailbreak-critical features, offering insight into the mechanistic basis of refusal. Moreover, we find evidence of redundant features that remain dormant unless earlier features are suppressed. Our findings highlight the potential for fine-grained auditing and targeted intervention in safety behaviours by manipulating the interpretable latent space.

[10] Assisting Research Proposal Writing with Large Language Models: Evaluation and Refinement

Jing Ren, Weiqi Wang

Main category: cs.CL

TL;DR: Proposes quantitative metrics for evaluating LLM-generated academic content quality and reference validity, plus iterative prompting to improve writing and reduce fabricated references.

Details

Motivation: Address ethical concerns about LLMs generating incorrect/fabricated references in academic writing and the lack of objective evaluation methods for content quality.

Method: Developed two evaluation metrics (content quality and reference validity) and an iterative prompting method based on these scores to enhance LLM writing performance.

Result: The metrics provide an objective quantitative framework for assessing ChatGPT’s writing, and iterative prompting significantly improves content quality while reducing reference inaccuracies and fabrications.

Conclusion: The proposed approach effectively addresses ethical challenges in academic LLM usage by providing objective evaluation and improving reference validity through iterative prompting.

Abstract: Large language models (LLMs) like ChatGPT are increasingly used in academic writing, yet issues such as incorrect or fabricated references raise ethical concerns. Moreover, current content quality evaluations often rely on subjective human judgment, which is labor-intensive and lacks objectivity, potentially compromising the consistency and reliability. In this study, to provide a quantitative evaluation and enhance research proposal writing capabilities of LLMs, we propose two key evaluation metrics–content quality and reference validity–and an iterative prompting method based on the scores derived from these two metrics. Our extensive experiments show that the proposed metrics provide an objective, quantitative framework for assessing ChatGPT’s writing performance. Additionally, iterative prompting significantly enhances content quality while reducing reference inaccuracies and fabrications, addressing critical ethical challenges in academic contexts.

[11] Generating Individual Travel Diaries Using Large Language Models Informed by Census and Land-Use Data

Sepehr Golrokh Amin, Devin Rhoads, Fatemeh Fakhrmoosavi, Nicholas E. Lownes, John N. Ivan

Main category: cs.CL

TL;DR: LLM-based method generates synthetic travel diaries from open-source data, achieving comparable realism to classical methods with better purpose prediction and consistency.

Details

Motivation: Traditional travel diary generation relies on proprietary surveys; this study aims to create synthetic diaries using open-source data and LLMs for more accessible transportation modeling.

Method: Generates personas from ACS and SLD data, synthesizes diaries through LLM prompting, and validates using a novel realism score with four metrics (Trip Count, Interval, Purpose, Mode) compared against real diaries using Jensen-Shannon Divergence.

Result: LLM-generated diaries achieve comparable overall realism (0.485 vs 0.455) to classical methods, excel in trip purpose determination, show greater consistency, and demonstrate statistical representativeness in aggregate validation.

Conclusion: LLMs are viable for zero-shot travel diary generation, establishing a quantifiable metric for synthetic diary evaluation and offering an alternative to traditional proprietary survey-based approaches.

Abstract: This study introduces a Large Language Model (LLM) scheme for generating individual travel diaries in agent-based transportation models. While traditional approaches rely on large quantities of proprietary household travel surveys, the method presented in this study generates personas stochastically from open-source American Community Survey (ACS) and Smart Location Database (SLD) data, then synthesizes diaries through direct prompting. This study features a novel one-to-cohort realism score: a composite of four metrics (Trip Count Score, Interval Score, Purpose Score, and Mode Score) validated against the Connecticut Statewide Transportation Study (CSTS) diaries, matched across demographic variables. The validation utilizes Jensen-Shannon Divergence to measure distributional similarities between generated and real diaries. When compared to diaries generated with classical methods (Negative Binomial for trip generation; Multinomial Logit for mode/purpose) calibrated on the validation set, LLM-generated diaries achieve comparable overall realism (LLM mean: 0.485 vs. 0.455). The LLM excels in determining trip purpose and demonstrates greater consistency (narrower realism score distribution), while classical models lead in numerical estimates of trip count and activity duration. Aggregate validation confirms the LLM’s statistical representativeness (LLM mean: 0.612 vs. 0.435), demonstrating LLM’s zero-shot viability and establishing a quantifiable metric of diary realism for future synthetic diary evaluation systems.

[12] Psychiatry-Bench: A Multi-Task Benchmark for LLMs in Psychiatry

Aya E. Fouda, Abdelrahamn A. Hassan, Radwa J. Hanafy, Mohammed E. Fouda

Main category: cs.CL

TL;DR: PsychiatryBench is a new benchmark for evaluating LLMs in psychiatry using expert-validated textbook content, revealing significant gaps in clinical consistency and safety across 11 tasks and 5,300+ items.

Details

Motivation: Existing LLM evaluation resources for psychiatry rely on limited clinical data, social media posts, or synthetic dialogues, which lack clinical validity and fail to capture the complexity of psychiatric reasoning.

Method: Created PsychiatryBench using authoritative psychiatric textbooks and casebooks, comprising 11 QA tasks (diagnostic reasoning, treatment planning, follow-up, etc.) with over 5,300 expert-annotated items. Evaluated frontier LLMs using conventional metrics and LLM-as-judge similarity scoring.

Result: Substantial gaps in clinical consistency and safety were found, particularly in multi-turn follow-up and management tasks, highlighting the need for specialized model tuning.

Conclusion: PsychiatryBench provides a modular, extensible platform for benchmarking and improving LLM performance in high-stakes mental health applications.

Abstract: Large language models (LLMs) hold great promise in enhancing psychiatric practice, from improving diagnostic accuracy to streamlining clinical documentation and therapeutic support. However, existing evaluation resources heavily rely on small clinical interview corpora, social media posts, or synthetic dialogues, which limits their clinical validity and fails to capture the full complexity of psychiatric reasoning. In this work, we introduce PsychiatryBench, a rigorously curated benchmark grounded exclusively in authoritative, expert-validated psychiatric textbooks and casebooks. PsychiatryBench comprises eleven distinct question-answering tasks ranging from diagnostic reasoning and treatment planning to longitudinal follow-up, management planning, clinical approach, sequential case analysis, and multiple-choice/extended matching formats totaling over 5,300 expert-annotated items. We evaluate a diverse set of frontier LLMs (including Google Gemini, DeepSeek, LLaMA 3, and QWQ-32) alongside leading open-source medical models (e.g., OpenBiloLLM, MedGemma) using both conventional metrics and an “LLM-as-judge” similarity scoring framework. Our results reveal substantial gaps in clinical consistency and safety, particularly in multi-turn follow-up and management tasks, underscoring the need for specialized model tuning and more robust evaluation paradigms. PsychiatryBench offers a modular, extensible platform for benchmarking and improving LLM performance in high-stakes mental health applications.

[13] The Thinking Therapist: Training Large Language Models to Deliver Acceptance and Commitment Therapy using Supervised Fine-Tuning and Odds Ratio Policy Optimization

Talha Tahir

Main category: cs.CL

TL;DR: ORPO training significantly outperforms SFT and base models in teaching small LLMs to deliver ACT therapy, with chain-of-thought reasoning helping SFT but not ORPO models.

Details

Motivation: To investigate how different post-training methods (SFT vs ORPO) and explicit reasoning (chain-of-thought) affect small LLMs' ability to deliver Acceptance and Commitment Therapy effectively.

Method: Trained Llama-3.2-3b-Instruct using 50 synthetic ACT transcripts with SFT and ORPO approaches, each with/without chain-of-thought reasoning. Evaluated models using ACT Fidelity Measure and Therapist Empathy Scale via LLM judge fine-tuned on human evaluations.

Result: ORPO-trained models significantly outperformed SFT and base models on both ACT fidelity (χ²=185.15, p<.001) and therapeutic empathy (χ²=140.37, p<.001). COT helped SFT models (improved ACT-FM by 2.68 points) but provided no advantage to ORPO models.

Conclusion: Preference-aligned policy optimization (ORPO) effectively teaches ACT competencies to small LLMs by focusing on therapeutic process rather than content imitation. The utility of explicit reasoning depends on the training paradigm.

Abstract: Acceptance and Commitment Therapy (ACT) is a third-wave cognitive behavioral therapy with emerging evidence of efficacy in several psychiatric conditions. This study investigates the impact of post-training methodology and explicit reasoning on the ability of a small open-weight large language model (LLM) to deliver ACT. Using 50 sets of synthetic ACT transcripts generated by Mistral-Large, we trained Llama-3.2-3b-Instruct with two distinct approaches, supervised fine-tuning (SFT) and odds ratio policy optimization (ORPO), each with and without an explicit chain-of-thought (COT) reasoning step. Performance was evaluated by comparing these four post-trained variants against the base Instruct model. These models were benchmarked in simulated therapy sessions, with performance quantitatively assessed on the ACT Fidelity Measure (ACT-FM) and the Therapist Empathy Scale (TES) by an LLM judge that had been fine-tuned on human evaluations. Our findings demonstrate that the ORPO-trained models significantly outperformed both their SFT and Instruct counterparts on ACT fidelity ($\chi^2(5) = 185.15, p < .001$) and therapeutic empathy ($\chi^2(5) = 140.37, p < .001$). The effect of COT was conditional as it provided a significant benefit to SFT models, improving ACT-FM scores by an average of 2.68 points ($p < .001$), while offering no discernible advantage to the superior ORPO or instruct-tuned variants. We posit that the superiority of ORPO stems from its ability to learn the therapeutic process' over imitating content,’ a key aspect of ACT, while COT acts as a necessary scaffold for models trained only via imitation. This study establishes that preference-aligned policy optimization can effectively instill ACT competencies in small LLMs, and that the utility of explicit reasoning is highly dependent on the underlying training paradigm.

[14] HANRAG: Heuristic Accurate Noise-resistant Retrieval-Augmented Generation for Multi-hop Question Answering

Duolin Sun, Dan Yang, Yue Shen, Yihan Jiao, Zhehao Tan, Jie Feng, Lianzhen Zhong, Jian Wang, Peng Wei, Jinjie Gu

Main category: cs.CL

TL;DR: HANRAG is a heuristic-based framework that improves RAG systems by routing queries, decomposing them into sub-queries, and filtering noise from retrieved documents to better handle multi-hop question-answering tasks.

Details

Motivation: Current RAG methods struggle with multi-hop queries due to inefficient iterative retrieval, failure to capture relevant sub-query content, and noise accumulation problems in retrieved documents.

Method: The HANRAG framework uses a powerful revelator to route queries, decompose complex queries into sub-queries, and filter noise from retrieved documents to enhance adaptability and noise resistance.

Result: HANRAG demonstrates superior performance compared to other leading industry methods across various benchmarks, excelling in both single-hop and multi-hop question-answering tasks.

Conclusion: The proposed HANRAG framework effectively addresses the limitations of current RAG approaches by providing efficient query decomposition and noise filtering, making it highly capable of handling diverse query complexities.

Abstract: The Retrieval-Augmented Generation (RAG) approach enhances question-answering systems and dialogue generation tasks by integrating information retrieval (IR) technologies with large language models (LLMs). This strategy, which retrieves information from external knowledge bases to bolster the response capabilities of generative models, has achieved certain successes. However, current RAG methods still face numerous challenges when dealing with multi-hop queries. For instance, some approaches overly rely on iterative retrieval, wasting too many retrieval steps on compound queries. Additionally, using the original complex query for retrieval may fail to capture content relevant to specific sub-queries, resulting in noisy retrieved content. If the noise is not managed, it can lead to the problem of noise accumulation. To address these issues, we introduce HANRAG, a novel heuristic-based framework designed to efficiently tackle problems of varying complexity. Driven by a powerful revelator, HANRAG routes queries, decomposes them into sub-queries, and filters noise from retrieved documents. This enhances the system’s adaptability and noise resistance, making it highly capable of handling diverse queries. We compare the proposed framework against other leading industry methods across various benchmarks. The results demonstrate that our framework obtains superior performance in both single-hop and multi-hop question-answering tasks.

[15] Prominence-aware automatic speech recognition for conversational speech

Julian Linke, Barbara Schuppler

Main category: cs.CL

TL;DR: Combined prominence detection and speech recognition for Austrian German using wav2vec2 models, achieving 85.53% prominence accuracy when words were correctly recognized, without improving baseline ASR performance.

Details

Motivation: To develop prominence-aware automatic speech recognition systems that can simultaneously transcribe words and detect prosodic prominence levels for conversational Austrian German, with applications in linguistic research and prosody-informed dialogue systems.

Method: Fine-tuned wav2vec2 models for word-level prominence detection, used the detector to automatically annotate prominence in a large corpus, and trained novel prominence-aware ASR systems that transcribe both words and prominence levels simultaneously.

Result: The integration of prominence information did not improve performance compared to baseline ASR system, but achieved 85.53% prominence detection accuracy for utterances where the recognized word sequence was correct.

Conclusion: Transformer-based models can effectively encode prosodic information, representing a novel contribution to prosody-enhanced ASR with potential applications in linguistic research and prosody-informed dialogue systems.

Abstract: This paper investigates prominence-aware automatic speech recognition (ASR) by combining prominence detection and speech recognition for conversational Austrian German. First, prominence detectors were developed by fine-tuning wav2vec2 models to classify word-level prominence. The detector was then used to automatically annotate prosodic prominence in a large corpus. Based on those annotations, we trained novel prominence-aware ASR systems that simultaneously transcribe words and their prominence levels. The integration of prominence information did not change performance compared to our baseline ASR system, while reaching a prominence detection accuracy of 85.53% for utterances where the recognized word sequence was correct. This paper shows that transformer-based models can effectively encode prosodic information and represents a novel contribution to prosody-enhanced ASR, with potential applications for linguistic research and prosody-informed dialogue systems.

[16] How Small Transformation Expose the Weakness of Semantic Similarity Measures

Serge Lionel Nikiema, Albérick Euraste Djire, Abdoul Aziz Bonkoungou, Micheline Bénédicte Moumoula, Jordan Samhi, Abdoul Kader Kabore, Jacques Klein, Tegawendé F. Bissyande

Main category: cs.CL

TL;DR: Study evaluates 18 semantic similarity methods for software engineering tasks, finding significant flaws in embedding-based approaches that often confuse opposites as similar, while LLM-based methods perform better at distinguishing true semantic differences.

Details

Motivation: To assess the reliability of semantic similarity measurement methods used in software engineering applications like code search and API recommendations, particularly questioning whether LLMs truly understand semantics or just recognize surface patterns.

Method: Created a systematic testing framework with controlled text and code changes to evaluate 18 different approaches including word-based methods, embedding techniques, LLM-based systems, and structure-aware algorithms.

Result: Embedding-based methods incorrectly identified semantic opposites as similar up to 99.9% of the time, while switching from Euclidean to cosine similarity improved results by 24-66%. LLM-based approaches performed better, producing low similarity scores (0.00-0.29) for different meanings vs embedding methods’ incorrect high scores (0.82-0.99).

Conclusion: Common semantic similarity metrics have significant flaws, with embedding methods particularly problematic for confusing opposites. Distance calculation method choice (cosine vs Euclidean) dramatically impacts performance, and LLM-based approaches show better semantic discrimination capabilities.

Abstract: This research examines how well different methods measure semantic similarity, which is important for various software engineering applications such as code search, API recommendations, automated code reviews, and refactoring tools. While large language models are increasingly used for these similarity assessments, questions remain about whether they truly understand semantic relationships or merely recognize surface patterns. The study tested 18 different similarity measurement approaches, including word-based methods, embedding techniques, LLM-based systems, and structure-aware algorithms. The researchers created a systematic testing framework that applies controlled changes to text and code to evaluate how well each method handles different types of semantic relationships. The results revealed significant issues with commonly used metrics. Some embedding-based methods incorrectly identified semantic opposites as similar up to 99.9 percent of the time, while certain transformer-based approaches occasionally rated opposite meanings as more similar than synonymous ones. The study found that embedding methods’ poor performance often stemmed from how they calculate distances; switching from Euclidean distance to cosine similarity improved results by 24 to 66 percent. LLM-based approaches performed better at distinguishing semantic differences, producing low similarity scores (0.00 to 0.29) for genuinely different meanings, compared to embedding methods that incorrectly assigned high scores (0.82 to 0.99) to dissimilar content.

[17] Investigating Symbolic Triggers of Hallucination in Gemma Models Across HaluEval and TruthfulQA

Naveen Lamba, Sanju Tiwari, Manas Gaur

Main category: cs.CL

TL;DR: This paper identifies and characterizes key symbolic properties that make LLMs intrinsically vulnerable to hallucinations, showing that modifiers and named entities remain problematic across model scales.

Details

Motivation: To understand the fundamental properties that make Large Language Models intrinsically vulnerable to hallucinations, as this problem is well-studied but the root causes haven't been properly identified.

Method: Used HaluEval and TruthfulQA datasets, converting their question-answering format into various formats to isolate and test specific symbolic properties as causes of hallucinations across different Gemma model sizes (2B, 9B, 27B).

Result: Hallucination percentages were high across symbolic properties: 79.0% for Gemma-2-2B, dropping to 73.6% for Gemma-2-9B and 63.9% for Gemma-2-27B. Modifiers (84.76%-94.98%) and named entities (83.87%-93.96%) showed particularly high hallucination rates across all models.

Conclusion: Symbolic elements like modifiers and named entities continue to confuse LLMs regardless of model scale, indicating a fundamental weakness in how these models process such inputs, with hallucinations persisting even as model size increases.

Abstract: Hallucination in Large Language Models (LLMs) is a well studied problem. However, the properties that make LLM intrinsically vulnerable to hallucinations have not been identified and studied. This research identifies and characterizes the key properties, allowing us to pinpoint vulnerabilities within the model’s internal mechanisms. To solidify on these properties, we utilized two established datasets, HaluEval and TruthfulQA and convert their existing format of question answering into various other formats to narrow down these properties as the reason for the hallucinations. Our findings reveal that hallucination percentages across symbolic properties are notably high for Gemma-2-2B, averaging 79.0% across tasks and datasets. With increased model scale, hallucination drops to 73.6% for Gemma-2-9B and 63.9% for Gemma-2-27B, reflecting a 15 percentage point reduction overall. Although the hallucination rate decreases as the model size increases, a substantial amount of hallucination caused by symbolic properties still persists. This is especially evident for modifiers (ranging from 84.76% to 94.98%) and named entities (ranging from 83.87% to 93.96%) across all Gemma models and both datasets. These findings indicate that symbolic elements continue to confuse the models, pointing to a fundamental weakness in how these LLMs process such inputs–regardless of their scale.

[18] ALIGNS: Unlocking nomological networks in psychological measurement through a large language model

Kai R. Larsen, Sen Yan, Roland Müller, Lan Sang, Mikko Rönkkö, Ravi Starzl, Donald Edmondson

Main category: cs.CL

TL;DR: ALIGNS is an LLM-based system that generates comprehensive nomological networks using validated questionnaire measures to address fundamental challenges in psychological measurement validation.

Details

Motivation: Building nomological networks has remained challenging for 70 years since Cronbach and Meehl proposed them, leading to practical consequences like failed clinical trials and misguided public policy targeting wrong outcomes.

Method: ALIGNS uses large language models trained with validated questionnaire measures to create three comprehensive nomological networks containing over 550,000 indicators across psychology, medicine, social policy, and other fields.

Result: The system demonstrated: 1) NIH PROMIS anxiety and depression instruments converge into a single emotional distress dimension, 2) child temperament measures reveal four new potential dimensions and question one existing dimension, 3) expert psychometricians validated its importance, accessibility, and suitability.

Conclusion: ALIGNS represents the first application of LLMs to solve foundational measurement validation problems, complementing traditional methods with large-scale nomological analysis and is freely available for use.

Abstract: Psychological measurement is critical to many disciplines. Despite advances in measurement, building nomological networks, theoretical maps of how concepts and measures relate to establish validity, remains a challenge 70 years after Cronbach and Meehl proposed them as fundamental to validation. This limitation has practical consequences: clinical trials may fail to detect treatment effects, and public policy may target the wrong outcomes. We introduce Analysis of Latent Indicators to Generate Nomological Structures (ALIGNS), a large language model-based system trained with validated questionnaire measures. ALIGNS provides three comprehensive nomological networks containing over 550,000 indicators across psychology, medicine, social policy, and other fields. This represents the first application of large language models to solve a foundational problem in measurement validation. We report classification accuracy tests used to develop the model, as well as three evaluations. In the first evaluation, the widely used NIH PROMIS anxiety and depression instruments are shown to converge into a single dimension of emotional distress. The second evaluation examines child temperament measures and identifies four potential dimensions not captured by current frameworks, and questions one existing dimension. The third evaluation, an applicability check, engages expert psychometricians who assess the system’s importance, accessibility, and suitability. ALIGNS is freely available at nomologicalnetwork.org, complementing traditional validation methods with large-scale nomological analysis.

[19] DiTTO-LLM: Framework for Discovering Topic-based Technology Opportunities via Large Language Model

Wonyoung Kim, Sujeong Seo, Juhyun Lee

Main category: cs.CL

TL;DR: A framework using temporal patent analysis and LLMs to identify emerging tech opportunities, tested on AI patents showing evolution toward everyday accessibility.

Details

Motivation: Technology opportunities are crucial for advancement but difficult to identify systematically. The paper aims to develop a structured approach to detect emerging tech trends from patent data.

Method: Extracts text from patents, maps text-based topics to discover inter-technology relationships, tracks topic changes over time, and uses LLMs with chat prompts to enhance opportunity discovery efficiency.

Result: Framework successfully identified that AI technology is evolving toward forms that facilitate everyday accessibility when tested on USPTO AI patent dataset.

Conclusion: The proposed framework demonstrates strong potential for identifying future technology opportunities through temporal patent analysis and LLM integration.

Abstract: Technology opportunities are critical information that serve as a foundation for advancements in technology, industry, and innovation. This paper proposes a framework based on the temporal relationships between technologies to identify emerging technology opportunities. The proposed framework begins by extracting text from a patent dataset, followed by mapping text-based topics to discover inter-technology relationships. Technology opportunities are then identified by tracking changes in these topics over time. To enhance efficiency, the framework leverages a large language model to extract topics and employs a prompt for a chat-based language model to support the discovery of technology opportunities. The framework was evaluated using an artificial intelligence patent dataset provided by the United States Patent and Trademark Office. The experimental results suggest that artificial intelligence technology is evolving into forms that facilitate everyday accessibility. This approach demonstrates the potential of the proposed framework to identify future technology opportunities.

[20] BIBERT-Pipe on Biomedical Nested Named Entity Linking at BioASQ 2025

Chunyu Li, Xindi Zheng, Siqi Liu

Main category: cs.CL

TL;DR: A lightweight pipeline for multilingual biomedical nested entity linking that maintains the original EL model while modifying three key components: two-stage retrieval-ranking, boundary cues with learnable tags, and dataset augmentation.

Details

Motivation: Entity linking for biomedical text has been primarily benchmarked on English-only corpora with flat mentions, leaving the realistic scenario of nested and multilingual mentions largely unexplored.

Method: A two-stage retrieval-ranking system using the same base encoder model (original pre-trained for retrieval, domain-specific fine-tuned for ranking), with learnable [Ms]/[Me] boundary tags for language-agnostic span detection, and automatic dataset augmentation from three complementary sources.

Result: The system (BIBERT-Pipe) ranked third in the BioNNE 2025 multilingual track leaderboard, demonstrating effectiveness and competitiveness of minimal yet principled modifications.

Conclusion: The proposed lightweight pipeline successfully addresses the challenges of multilingual biomedical nested entity linking through targeted modifications while keeping the core EL model intact, showing strong performance on the benchmark task.

Abstract: Entity linking (EL) for biomedical text is typically benchmarked on English-only corpora with flat mentions, leaving the more realistic scenario of nested and multilingual mentions largely unexplored. We present our system for the BioNNE 2025 Multilingual Biomedical Nested Named Entity Linking shared task (English & Russian), closing this gap with a lightweight pipeline that keeps the original EL model intact and modifies only three task-aligned components: Two-stage retrieval-ranking. We leverage the same base encoder model in both stages: the retrieval stage uses the original pre-trained model, while the ranking stage applies domain-specific fine-tuning. Boundary cues. In the ranking stage, we wrap each mention with learnable [Ms] / [Me] tags, providing the encoder with an explicit, language-agnostic span before robustness to overlap and nesting. Dataset augmentation. We also automatically expand the ranking training corpus with three complementary data sources, enhancing coverage without extra manual annotation. On the BioNNE 2025 leaderboard, our two stage system, bilingual bert (BIBERT-Pipe), ranks third in the multilingual track, demonstrating the effectiveness and competitiveness of these minimal yet principled modifications. Code are publicly available at https://github.com/Kaggle-Competitions-Code/BioNNE-L.

[21] Natural Language Translation of Formal Proofs through Informalization of Proof Steps and Recursive Summarization along Proof Structure

Seiji Hattori, Takuya Matsuzaki, Makoto Fujiwara

Main category: cs.CL

TL;DR: LLM-based method for translating formal proofs to natural language using informalization and summarization, evaluated on textbook proofs and Lean library with high readability and accuracy.

Details

Motivation: To bridge the gap between machine-verifiable formal proofs and human-readable natural language proofs by leveraging LLM capabilities for better accessibility and understanding.

Method: Uses LLMs for informalization (verbalizing formal proof steps) and summarization to translate formal proofs into natural language, evaluated on undergraduate textbook proofs and Lean proof assistant library.

Result: Generated natural language proofs show high readability and accuracy compared to original textbook proofs, successfully applied to existing Lean formal proof library.

Conclusion: The proposed method effectively produces readable and accurate natural language translations of formal proofs, demonstrating practical utility for making formal proofs more accessible.

Abstract: This paper proposes a natural language translation method for machine-verifiable formal proofs that leverages the informalization (verbalization of formal language proof steps) and summarization capabilities of LLMs. For evaluation, it was applied to formal proof data created in accordance with natural language proofs taken from an undergraduate-level textbook, and the quality of the generated natural language proofs was analyzed in comparison with the original natural language proofs. Furthermore, we will demonstrate that this method can output highly readable and accurate natural language proofs by applying it to existing formal proof library of the Lean proof assistant.

[22] A Role-Aware Multi-Agent Framework for Financial Education Question Answering with LLMs

Andy Zhu, Yingjun Du

Main category: cs.CL

TL;DR: Multi-agent framework with role-based prompting improves financial QA accuracy by 6.6-8.3% over zero-shot baselines, enabling smaller models to match specialized finance-tuned models.

Details

Motivation: Existing LLMs fail to capture nuanced financial reasoning requiring multistep quantitative analysis, domain terminology, and real-world scenario comprehension in financial education QA.

Method: Three-agent framework: Base Generator, Evidence Retriever, and Expert Reviewer using RAG from finance textbooks and role-based prompting in single-pass iteration.

Result: Critique-based refinement improved accuracy by 6.6-8.3% over zero-shot Chain-of-Thought, with Gemini-2.0-Flash performing best. GPT-4o-mini achieved comparable performance to finance-tuned FinGPT-mt_Llama3-8B_LoRA.

Conclusion: Cost-effective multi-agent approach enhances financial QA performance and provides insights for future research in multi-agent financial LLM systems.

Abstract: Question answering (QA) plays a central role in financial education, yet existing large language model (LLM) approaches often fail to capture the nuanced and specialized reasoning required for financial problem-solving. The financial domain demands multistep quantitative reasoning, familiarity with domain-specific terminology, and comprehension of real-world scenarios. We present a multi-agent framework that leverages role-based prompting to enhance performance on domain-specific QA. Our framework comprises a Base Generator, an Evidence Retriever, and an Expert Reviewer agent that work in a single-pass iteration to produce a refined answer. We evaluated our framework on a set of 3,532 expert-designed finance education questions from Study.com, an online learning platform. We leverage retrieval-augmented generation (RAG) for contextual evidence from 6 finance textbooks and prompting strategies for a domain-expert reviewer. Our experiments indicate that critique-based refinement improves answer accuracy by 6.6-8.3% over zero-shot Chain-of-Thought baselines, with the highest performance from Gemini-2.0-Flash. Furthermore, our method enables GPT-4o-mini to achieve performance comparable to the finance-tuned FinGPT-mt_Llama3-8B_LoRA. Our results show a cost-effective approach to enhancing financial QA and offer insights for further research in multi-agent financial LLM systems.

[23] A meta-analysis on the performance of machine-learning based language models for sentiment analysis

Elena Rohde, Jonas Klingwort, Christian Borgs

Main category: cs.CL

TL;DR: Meta-analysis of 195 trials from 20 studies shows ML models achieve 80% average accuracy in Twitter sentiment analysis, but highlights problems with using overall accuracy as a performance metric and calls for standardized reporting practices.

Details

Motivation: To evaluate the average performance of machine learning models in Twitter sentiment analysis, assess heterogeneity between studies, and understand how study characteristics influence model performance.

Method: Used PRISMA guidelines to search academic databases, selected 195 trials from 20 studies with 12 features. Analyzed overall accuracy using double arcsine transformation and three-level random effects model with AIC optimization.

Result: The average overall accuracy of the optimized model was 0.80 [0.76, 0.84]. Found that overall accuracy is misleading due to sensitivity to class imbalance and number of sentiment classes.

Conclusion: Overall accuracy should be normalized for reliable comparisons, and standardized reporting including confusion matrices for independent test sets is essential but not commonly practiced in current research.

Abstract: This paper presents a meta-analysis evaluating ML performance in sentiment analysis for Twitter data. The study aims to estimate the average performance, assess heterogeneity between and within studies, and analyze how study characteristics influence model performance. Using PRISMA guidelines, we searched academic databases and selected 195 trials from 20 studies with 12 study features. Overall accuracy, the most reported performance metric, was analyzed using double arcsine transformation and a three-level random effects model. The average overall accuracy of the AIC-optimized model was 0.80 [0.76, 0.84]. This paper provides two key insights: 1) Overall accuracy is widely used but often misleading due to its sensitivity to class imbalance and the number of sentiment classes, highlighting the need for normalization. 2) Standardized reporting of model performance, including reporting confusion matrices for independent test sets, is essential for reliable comparisons of ML classifiers across studies, which seems far from common practice.

[24] Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning

Haiyang Yu, Yuchuan Wu, Fan Shi, Lei Liao, Jinghui Lu, Xiaodong Ge, Han Wang, Minghan Zhuo, Xuecheng Wu, Xiang Fei, Hao Feng, Guozhi Tang, An-Lan Wang, Hanshen Zhu, Yangfan He, Quanhuan Liang, Liyuan Meng, Chao Feng, Can Huang, Jingqun Tang, Bin Li

Main category: cs.CL

TL;DR: AncientDoc is the first benchmark for evaluating Vision-Language Models on Chinese ancient documents, featuring five tasks across diverse document types to assess capabilities from OCR to knowledge reasoning.

Details

Motivation: Chinese ancient documents contain valuable historical and cultural knowledge but face digitization challenges. Current VLMs struggle with their visual and linguistic complexity, and existing benchmarks focus only on English or simplified Chinese, leaving a gap for ancient Chinese document evaluation.

Method: Created AncientDoc benchmark with five tasks: page-level OCR, vernacular translation, reasoning-based QA, knowledge-based QA, and linguistic variant QA. The dataset covers 14 document types, over 100 books, and about 3,000 pages. Evaluated mainstream VLMs using multiple metrics with human-aligned LLM scoring.

Result: The paper presents the AncientDoc benchmark but does not specify the actual evaluation results of the VLMs tested in this abstract. The benchmark itself is successfully created and ready for model evaluation.

Conclusion: AncientDoc fills a critical gap in evaluating VLMs for Chinese ancient document understanding, providing a comprehensive benchmark that addresses the unique challenges of these culturally significant documents through multiple task types and extensive document coverage.

Abstract: Chinese ancient documents, invaluable carriers of millennia of Chinese history and culture, hold rich knowledge across diverse fields but face challenges in digitization and understanding, i.e., traditional methods only scan images, while current Vision-Language Models (VLMs) struggle with their visual and linguistic complexity. Existing document benchmarks focus on English printed texts or simplified Chinese, leaving a gap for evaluating VLMs on ancient Chinese documents. To address this, we present AncientDoc, the first benchmark for Chinese ancient documents, designed to assess VLMs from OCR to knowledge reasoning. AncientDoc includes five tasks (page-level OCR, vernacular translation, reasoning-based QA, knowledge-based QA, linguistic variant QA) and covers 14 document types, over 100 books, and about 3,000 pages. Based on AncientDoc, we evaluate mainstream VLMs using multiple metrics, supplemented by a human-aligned large language model for scoring.

[25] MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools

Zikang Guo, Benfeng Xu, Chiwei Zhu, Wentao Hong, Xiaorui Wang, Zhendong Mao

Main category: cs.CL

TL;DR: MCP-AgentBench is a new benchmark designed to evaluate language agent capabilities in MCP-mediated tool interactions, addressing the gap in existing benchmarks that fail to capture real-world agent performance in this emerging paradigm.

Details

Motivation: Existing benchmarks fail to properly evaluate agent performance in the Model Context Protocol (MCP) ecosystem, leading to distorted perceptions of agent capabilities and inability to differentiate proficiencies in real-world MCP tool interactions.

Method: Developed a comprehensive benchmark with: 1) MCP testbed with 33 operational servers and 188 distinct tools, 2) 600 systematically designed queries across 6 categories of varying complexity, 3) MCP-Eval outcome-oriented evaluation methodology prioritizing task success.

Result: The benchmark provides foundational insights through extensive empirical evaluation of leading language agents, enabling reliable assessment of agent capabilities in MCP environments.

Conclusion: MCP-AgentBench provides a standardized framework for building, validating, and advancing agents that can fully leverage MCP’s benefits, accelerating progress toward capable and interoperable AI systems.

Abstract: The Model Context Protocol (MCP) is rapidly emerging as a pivotal open standard, designed to enhance agent-tool integration and interoperability, and is positioned to unlock a new era of powerful, interconnected, and genuinely utilitarian agentic AI. However, despite MCP’s growing adoption, existing benchmarks often fail to capture real-world agent performance within this new paradigm, leading to a distorted perception of their true operational value and an inability to reliably differentiate proficiencies. To bridge this critical evaluation gap, we introduce MCP-AgentBench – a comprehensive benchmark specifically engineered to rigorously assess language agent capabilities in MCP-mediated tool interactions. Core contributions of MCP-AgentBench include: the establishment of a robust MCP testbed comprising 33 operational servers with 188 distinct tools; the development of a benchmark featuring 600 systematically designed queries distributed across 6 distinct categories of varying interaction complexity; and the introduction of MCP-Eval, a novel outcome-oriented evaluation methodology prioritizing real-world task success. Through extensive empirical evaluation of leading language agents, we provide foundational insights. MCP-AgentBench aims to equip the research community with a standardized and reliable framework to build, validate, and advance agents capable of fully leveraging MCP’s transformative benefits, thereby accelerating progress toward truly capable and interoperable AI systems.

[26] Discrimination by LLMs: Cross-lingual Bias Assessment and Mitigation in Decision-Making and Summarisation

Willem Huijzer, Jieying Chen

Main category: cs.CL

TL;DR: LLMs show significant biases in decision-making tasks favoring female gender, younger ages, and African-American backgrounds, while summarization shows minimal bias. Cross-lingual analysis reveals similar bias patterns between English and Dutch. Prompt-based mitigation strategies can reduce biases by up to 27%, with GPT-4o showing better bias reduction than GPT-3.5.

Details

Motivation: To address concerns about societal inequalities and information bias as LLMs are rapidly integrated into various domains, by examining biases related to background, gender, and age in decision-making and summarization tasks, and evaluating mitigation strategies.

Method: Used an adapted dataset translated into Dutch to create 151,200 prompts for decision tasks and 176,400 for summarization tasks. Tested various demographic variables, instructions, salience levels, and languages on GPT-3.5 and GPT-4o models.

Result: Both models showed significant bias in decision-making (favoring female gender, younger ages, African-American background), while summarization showed minimal bias. Cross-lingual patterns were similar between English and Dutch. Prompt-based mitigation achieved 27% mean reduction in bias gaps, with GPT-4o showing better bias reduction capabilities.

Conclusion: Cautious adoption of LLMs and context-specific bias testing are crucial. Continued development of effective mitigation strategies is needed for responsible AI deployment, with newer models like GPT-4o showing promise for prompt-based bias reduction.

Abstract: The rapid integration of Large Language Models (LLMs) into various domains raises concerns about societal inequalities and information bias. This study examines biases in LLMs related to background, gender, and age, with a focus on their impact on decision-making and summarization tasks. Additionally, the research examines the cross-lingual propagation of these biases and evaluates the effectiveness of prompt-instructed mitigation strategies. Using an adapted version of the dataset by Tamkin et al. (2023) translated into Dutch, we created 151,200 unique prompts for the decision task and 176,400 for the summarisation task. Various demographic variables, instructions, salience levels, and languages were tested on GPT-3.5 and GPT-4o. Our analysis revealed that both models were significantly biased during decision-making, favouring female gender, younger ages, and certain backgrounds such as the African-American background. In contrast, the summarisation task showed minimal evidence of bias, though significant age-related differences emerged for GPT-3.5 in English. Cross-lingual analysis showed that bias patterns were broadly similar between English and Dutch, though notable differences were observed across specific demographic categories. The newly proposed mitigation instructions, while unable to eliminate biases completely, demonstrated potential in reducing them. The most effective instruction achieved a 27% mean reduction in the gap between the most and least favorable demographics. Notably, contrary to GPT-3.5, GPT-4o displayed reduced biases for all prompts in English, indicating the specific potential for prompt-based mitigation within newer models. This research underscores the importance of cautious adoption of LLMs and context-specific bias testing, highlighting the need for continued development of effective mitigation strategies to ensure responsible deployment of AI.

[27] HEFT: A Coarse-to-Fine Hierarchy for Enhancing the Efficiency and Accuracy of Language Model Reasoning

Brennen Hill

Main category: cs.CL

TL;DR: HEFT combines LoRA (weight space) and ReFT (representation space) PEFT methods in a hierarchical approach, achieving 85.17% accuracy on BoolQ with just 3 epochs, outperforming individual methods trained for 20 epochs.

Details

Motivation: Overcoming computational constraints in adapting LLMs to specialized reasoning tasks by synergistically combining different PEFT paradigms for superior efficiency and performance.

Method: Hierarchical Efficient Fine-Tuning (HEFT) that first applies broad adaptation with LoRA in weight space, then precise refinement with Representation Fine-Tuning (ReFT) on internal activations.

Result: 85.17% accuracy on BoolQ benchmark with only 3 epochs, surpassing LoRA-only (85.05%) and ReFT-only (83.36%) methods trained for 20 epochs.

Conclusion: Thoughtful composition of PEFT methods offers an efficient path to enhance LLM reasoning capabilities, demonstrating synergistic effects that overcome computational barriers for complex cognitive tasks.

Abstract: The adaptation of large language models (LLMs) to specialized reasoning tasks is fundamentally constrained by computational resources. Parameter-Efficient Fine-Tuning (PEFT) methods have emerged as a powerful solution, yet the landscape of these techniques is diverse, with distinct methods operating in either the model’s weight space or its representation space. This paper investigates the hypothesis that a synergistic combination of these paradigms can unlock superior performance and efficiency. We introduce HEFT (Hierarchical Efficient Fine-Tuning), a novel hierarchical adaptation strategy that composes two distinct PEFT methods in a coarse-to-fine manner: first, a broad, foundational adaptation in the weight space using Low-Rank Adaptation (LoRA), followed by a precise, surgical refinement of internal activations using Representation Fine-Tuning (ReFT). We evaluate this approach by fine-tuning a Llama-2-7B model on the BoolQ benchmark, a challenging dataset for inferential reasoning. Our results reveal a profound synergistic effect. A model fine-tuned for only three epochs with our HEFT strategy achieves an accuracy of 85.17%, exceeding the performance of models trained for 20 epochs with either LoRA-only (85.05%) or ReFT-only (83.36%) methodologies. This work demonstrates that the thoughtful composition of PEFT methods is a potent algorithmic innovation, offering a more efficient and effective path toward advancing the reasoning capabilities of language models. By achieving superior results with a fraction of the computational budget, our findings present a principled approach to overcoming the obstacles inherent in adapting large-scale models for complex cognitive tasks.

[28] Pragmatic Frames Evoked by Gestures: A FrameNet Brasil Approach to Multimodality in Turn Organization

Helen de Andrade Abreu, Tiago Timponi Torrent, Ely Edison da Silva Matos

Main category: cs.CL

TL;DR: A framework for modeling multimodal conversational turn organization through correlations between language and interactive gestures, with evidence from enriched annotations of pragmatic frames in a multimodal dataset.

Details

Motivation: To fill the gap in machine learning datasets by encoding specific strategies and gestures used for conversational turn organization, which had not been previously documented in available datasets.

Method: Developed an annotation methodology to enrich the Frame2 dataset (Brazilian TV series episodes) with pragmatic frames modeling turn organization, specifically annotating gestures used for passing, taking, and keeping conversational turns.

Result: Confirmed that communicators use gestures as tools for turn management in face-to-face conversation and revealed previously undocumented gesture variations. Demonstrated that pragmatic frame annotation contributes to understanding human cognition and language.

Conclusion: The study provides evidence that gestures arise from conceptualization of pragmatic frames involving mental spaces, blending and conceptual metaphors, and that multimodal annotation enhances understanding of conversational dynamics and human cognition.

Abstract: This paper proposes a framework for modeling multimodal conversational turn organization via the proposition of correlations between language and interactive gestures, based on analysis as to how pragmatic frames are conceptualized and evoked by communicators. As a means to provide evidence for the analysis, we developed an annotation methodology to enrich a multimodal dataset (annotated for semantic frames) with pragmatic frames modeling conversational turn organization. Although conversational turn organization has been studied by researchers from diverse fields, the specific strategies, especially gestures used by communicators, had not yet been encoded in a dataset that can be used for machine learning. To fill this gap, we enriched the Frame2 dataset with annotations of gestures used for turn organization. The Frame2 dataset features 10 episodes from the Brazilian TV series Pedro Pelo Mundo annotated for semantic frames evoked in both video and text. This dataset allowed us to closely observe how communicators use interactive gestures outside a laboratory, in settings, to our knowledge, not previously recorded in related literature. Our results have confirmed that communicators involved in face-to-face conversation make use of gestures as a tool for passing, taking and keeping conversational turns, and also revealed variations of some gestures that had not been documented before. We propose that the use of these gestures arises from the conceptualization of pragmatic frames, involving mental spaces, blending and conceptual metaphors. In addition, our data demonstrate that the annotation of pragmatic frames contributes to a deeper understanding of human cognition and language.

[29] Topic-Guided Reinforcement Learning with LLMs for Enhancing Multi-Document Summarization

Chuyuan Li, Austin Xu, Shafiq Joty, Giuseppe Carenini

Main category: cs.CL

TL;DR: Topic-guided reinforcement learning approach improves multi-document summarization by using topic rewards to enhance content selection and alignment with source documents.

Details

Motivation: Large Language Models perform well on single-document summarization but still have room for improvement in Multi-Document Summarization (MDS), particularly in effectively integrating information from multiple sources while maintaining coherence and topical relevance.

Method: Propose a topic-guided reinforcement learning approach using topic rewards within the Group Relative Policy Optimization (GRPO) framework to measure topic alignment between generated summaries and source documents. First demonstrate that explicit topic prompting enhances informativeness.

Result: Experimental results on Multi-News and Multi-XScience datasets show the method consistently outperforms strong baselines.

Conclusion: Leveraging topical cues through reinforcement learning is an effective approach for improving content selection and overall performance in multi-document summarization.

Abstract: A key challenge in Multi-Document Summarization (MDS) is effectively integrating information from multiple sources while maintaining coherence and topical relevance. While Large Language Models have shown impressive results in single-document summarization, their performance on MDS still leaves room for improvement. In this paper, we propose a topic-guided reinforcement learning approach to improve content selection in MDS. We first show that explicitly prompting models with topic labels enhances the informativeness of the generated summaries. Building on this insight, we propose a novel topic reward within the Group Relative Policy Optimization (GRPO) framework to measure topic alignment between the generated summary and source documents. Experimental results on the Multi-News and Multi-XScience datasets demonstrate that our method consistently outperforms strong baselines, highlighting the effectiveness of leveraging topical cues in MDS.

[30] Emulating Public Opinion: A Proof-of-Concept of AI-Generated Synthetic Survey Responses for the Chilean Case

Bastián González-Bustamante, Nando Verelst, Carla Cisternas

Main category: cs.CL

TL;DR: LLMs can generate synthetic survey responses that approximate human responses from probabilistic surveys, with excellent performance on trust items and comparable results across top models like GPT-4o and Llama 4 Maverick, though performance varies by item and demographic factors.

Details

Motivation: To evaluate whether LLM-generated synthetic survey responses can reliably emulate human answers and behavior while mitigating measurement and representation errors in survey research, while also assessing risks of reproducing social biases.

Method: Benchmarked 128 prompt-model-question triplets generating 189,696 synthetic profiles, compared against ground-truth human responses from a Chilean public opinion probabilistic survey. Used meta-analysis across 128 question-subsample pairs with performance metrics (accuracy, precision, recall, F1-score) to test biases along sociodemographic dimensions.

Result: Synthetic responses achieved excellent performance on trust items (F1-score and accuracy > 0.90). GPT-4o, GPT-4o-mini and Llama 4 Maverick performed comparably. Synthetic-human alignment was highest among respondents aged 45-59. Overall good approximation but with substantial item-level heterogeneity.

Conclusion: LLM-based synthetic samples can approximate probabilistic sample responses but require careful calibration and additional distributional tests to ensure algorithmic fidelity and reduce errors, as capturing full nuance of public opinion remains challenging.

Abstract: Large Language Models (LLMs) offer promising avenues for methodological and applied innovations in survey research by using synthetic respondents to emulate human answers and behaviour, potentially mitigating measurement and representation errors. However, the extent to which LLMs recover aggregate item distributions remains uncertain and downstream applications risk reproducing social stereotypes and biases inherited from training data. We evaluate the reliability of LLM-generated synthetic survey responses against ground-truth human responses from a Chilean public opinion probabilistic survey. Specifically, we benchmark 128 prompt-model-question triplets, generating 189,696 synthetic profiles, and pool performance metrics (i.e., accuracy, precision, recall, and F1-score) in a meta-analysis across 128 question-subsample pairs to test for biases along key sociodemographic dimensions. The evaluation spans OpenAI’s GPT family and o-series reasoning models, as well as Llama and Qwen checkpoints. Three results stand out. First, synthetic responses achieve excellent performance on trust items (F1-score and accuracy > 0.90). Second, GPT-4o, GPT-4o-mini and Llama 4 Maverick perform comparably on this task. Third, synthetic-human alignment is highest among respondents aged 45-59. Overall, LLM-based synthetic samples approximate responses from a probabilistic sample, though with substantial item-level heterogeneity. Capturing the full nuance of public opinion remains challenging and requires careful calibration and additional distributional tests to ensure algorithmic fidelity and reduce errors.

[31] Large Language Models Meet Legal Artificial Intelligence: A Survey

Zhitian Hou, Zihan Ye, Nanli Zeng, Tianyong Hao, Kun Zeng

Main category: cs.CL

TL;DR: Comprehensive review of 16 legal LLM series, 47 LLM-based legal frameworks, 15 benchmarks, and 29 datasets for evaluating legal AI capabilities, with analysis of challenges and future directions.

Details

Motivation: To advance research and applications of LLM-based approaches in legal domain by providing systematic resources and guidance for beginners and researchers.

Method: Systematic review and compilation of existing legal LLMs, frameworks, benchmarks, and datasets, followed by analysis of challenges and future research directions.

Result: Created a comprehensive resource repository including 16 legal LLM series, 47 frameworks, 15 benchmarks, and 29 datasets for legal AI evaluation and development.

Conclusion: This paper provides foundational resources and guidance to accelerate LLM-based legal AI research, highlighting current challenges and future opportunities in the field.

Abstract: Large Language Models (LLMs) have significantly advanced the development of Legal Artificial Intelligence (Legal AI) in recent years, enhancing the efficiency and accuracy of legal tasks. To advance research and applications of LLM-based approaches in legal domain, this paper provides a comprehensive review of 16 legal LLMs series and 47 LLM-based frameworks for legal tasks, and also gather 15 benchmarks and 29 datasets to evaluate different legal capabilities. Additionally, we analyse the challenges and discuss future directions for LLM-based approaches in the legal domain. We hope this paper provides a systematic introduction for beginners and encourages future research in this field. Resources are available at https://github.com/ZhitianHou/LLMs4LegalAI.

[32] CMHG: A Dataset and Benchmark for Headline Generation of Minority Languages in China

Guixian Xu, Zeli Su, Ziyin Zhang, Jianing Liu, XU Han, Ting Zhang, Yushuang Dong

Main category: cs.CL

TL;DR: New CMHG dataset created for Chinese minority languages (Tibetan, Uyghur, Mongolian) to address headline generation challenges due to unique writing systems and lack of corpora.

Details

Motivation: Minority languages in China face significant challenges due to unique writing systems that differ from international standards, leading to severe lack of relevant corpora for supervised tasks like headline generation.

Method: Introduce Chinese Minority Headline Generation (CMHG) dataset with 100,000 Tibetan entries and 50,000 each for Uyghur and Mongolian, plus a high-quality native-speaker annotated test set as benchmark.

Result: Creation of a comprehensive dataset specifically curated for headline generation tasks in Tibetan, Uyghur, and Mongolian languages.

Conclusion: The CMHG dataset serves as a valuable resource for advancing headline generation in Chinese minority languages and contributes to developing related benchmarks for future research.

Abstract: Minority languages in China, such as Tibetan, Uyghur, and Traditional Mongolian, face significant challenges due to their unique writing systems, which differ from international standards. This discrepancy has led to a severe lack of relevant corpora, particularly for supervised tasks like headline generation. To address this gap, we introduce a novel dataset, Chinese Minority Headline Generation (CMHG), which includes 100,000 entries for Tibetan, and 50,000 entries each for Uyghur and Mongolian, specifically curated for headline generation tasks. Additionally, we propose a high-quality test set annotated by native speakers, designed to serve as a benchmark for future research in this domain. We hope this dataset will become a valuable resource for advancing headline generation in Chinese minority languages and contribute to the development of related benchmarks.

[33] Unsupervised Hallucination Detection by Inspecting Reasoning Processes

Ponhvoan Srey, Xiaobao Wu, Anh Tuan Luu

Main category: cs.CL

TL;DR: IRIS is an unsupervised hallucination detection framework that uses LLM’s internal representations and response uncertainty to identify factual errors without labeled data, outperforming existing methods.

Details

Motivation: Existing unsupervised hallucination detection methods rely on proxy signals unrelated to factual correctness, limiting their generalizability across datasets and scenarios.

Method: IRIS prompts LLMs to verify statement truthfulness, extracts contextualized embeddings as features, and uses response uncertainty as soft pseudolabels for training.

Result: IRIS consistently outperforms existing unsupervised methods, is computationally low cost, and works well even with limited training data.

Conclusion: The proposed IRIS framework provides an effective unsupervised approach for real-time hallucination detection by leveraging intrinsic factual correctness signals from LLMs.

Abstract: Unsupervised hallucination detection aims to identify hallucinated content generated by large language models (LLMs) without relying on labeled data. While unsupervised methods have gained popularity by eliminating labor-intensive human annotations, they frequently rely on proxy signals unrelated to factual correctness. This misalignment biases detection probes toward superficial or non-truth-related aspects, limiting generalizability across datasets and scenarios. To overcome these limitations, we propose IRIS, an unsupervised hallucination detection framework, leveraging internal representations intrinsic to factual correctness. IRIS prompts the LLM to carefully verify the truthfulness of a given statement, and obtain its contextualized embedding as informative features for training. Meanwhile, the uncertainty of each response is considered a soft pseudolabel for truthfulness. Experimental results demonstrate that IRIS consistently outperforms existing unsupervised methods. Our approach is fully unsupervised, computationally low cost, and works well even with few training data, making it suitable for real-time detection.

[34] Multi-Intent Recognition in Dialogue Understanding: A Comparison Between Smaller Open-Source LLMs

Adnan Ahmad, Philine Kowol, Stefan Hillmann, Sebastian Möller

Main category: cs.CL

TL;DR: Open-source LLMs (LLama2-7B, Mistral-7B, Yi-6B) evaluated on multi-label intent classification using MultiWOZ 2.1 dataset. Mistral-7B performed best in few-shot setting, but BERT-based supervised learning outperformed all LLMs.

Details

Motivation: To analyze the effectiveness of publicly available open-source LLMs that can run on consumer hardware for multi-label intent classification in dialogue systems, comparing them with traditional supervised approaches.

Method: Used MultiWOZ 2.1 dataset with few-shot setup (20 examples per prompt). Evaluated three LLMs (LLama2-7B, Mistral-7B, Yi-6B) and compared with BERT-based supervised classifier. Measured accuracy, precision, recall, F1 scores, inference time, and VRAM requirements.

Result: Mistral-7B outperformed other LLMs on 11 out of 14 intent classes with weighted F1 score of 0.50, showing lower Humming Loss and higher Jaccard Similarity. However, BERT-based supervised classifier achieved superior performance compared to the best LLM.

Conclusion: While open-source LLMs show promise for multi-intent detection in few-shot settings, traditional supervised learning with smaller models like BERT still provides better performance for multi-label intent classification tasks in dialogue systems.

Abstract: In this paper, we provide an extensive analysis of multi-label intent classification using Large Language Models (LLMs) that are open-source, publicly available, and can be run in consumer hardware. We use the MultiWOZ 2.1 dataset, a benchmark in the dialogue system domain, to investigate the efficacy of three popular open-source pre-trained LLMs, namely LLama2-7B-hf, Mistral-7B-v0.1, and Yi-6B. We perform the classification task in a few-shot setup, giving 20 examples in the prompt with some instructions. Our approach focuses on the differences in performance of these models across several performance metrics by methodically assessing these models on multi-label intent classification tasks. Additionally, we compare the performance of the instruction-based fine-tuning approach with supervised learning using the smaller transformer model BertForSequenceClassification as a baseline. To evaluate the performance of the models, we use evaluation metrics like accuracy, precision, and recall as well as micro, macro, and weighted F1 score. We also report the inference time, VRAM requirements, etc. The Mistral-7B-v0.1 outperforms two other generative models on 11 intent classes out of 14 in terms of F-Score, with a weighted average of 0.50. It also has relatively lower Humming Loss and higher Jaccard Similarity, making it the winning model in the few-shot setting. We find BERT based supervised classifier having superior performance compared to the best performing few-shot generative LLM. The study provides a framework for small open-source LLMs in detecting complex multi-intent dialogues, enhancing the Natural Language Understanding aspect of task-oriented chatbots.

Laurin Plank, Armin Zlomuzica

Main category: cs.CL

TL;DR: Study uses social media language analysis to track bipolar disorder diagnosis timing and shows pervasive linguistic changes reflecting mood disturbances, comorbidities, and recurring seasonal patterns over decades.

Details

Motivation: Clinical assessments for bipolar disorder are limited in scale, while social media language analysis offers high temporal resolution and longitudinal scope for studying mental health markers.

Method: Introduced a method to determine timing of users’ diagnoses and analyzed language trajectories from 3 years before to 21 years after bipolar disorder diagnosis, compared with unipolar depression and non-affected users.

Result: Found pervasive linguistic alterations reflecting mood disturbance, psychiatric comorbidity, substance abuse, hospitalization, and disorganized thought. Observed recurring mood-related language changes with 12-month periodicity suggestive of seasonal episodes, with increased periodicity in female users.

Conclusion: Provides evidence for language alterations in both acute and chronic phases of bipolar disorder, validating social media as a scalable tool for mental health monitoring.

Abstract: Language provides valuable markers of affective disorders such as bipolar disorder (BD), yet clinical assessments remain limited in scale. In response, analyses of social media (SM) language have gained prominence due to their high temporal resolution and longitudinal scope. Here, we introduce a method to determine the timing of users’ diagnoses and apply it to study language trajectories from 3 years before to 21 years after BD diagnosis - contrasted with uses reporting unipolar depression (UD) and non-affected users (HC). We show that BD diagnosis is accompanied by pervasive linguistic alterations reflecting mood disturbance, psychiatric comorbidity, substance abuse, hospitalization, medical comorbidities, unusual thought content, and disorganized thought. We further observe recurring mood-related language changes across two decades after the diagnosis, with a pronounced 12-month periodicity suggestive of seasonal mood episodes. Finally, trend-level evidence suggests an increased periodicity in users estimated to be female. In sum, our findings provide evidence for language alterations in the acute and chronic phase of BD. This validates and extends recent efforts leveraging SM for scalable monitoring of mental health.

[36] !MSA at BAREC Shared Task 2025: Ensembling Arabic Transformers for Readability Assessment

Mohamed Basem, Mohamed Younes, Seif Ahmed, Abdelrahman Moustafa

Main category: cs.CL

TL;DR: MSA’s winning system for Arabic readability assessment uses a confidence-weighted ensemble of four transformer models with diverse loss functions, enhanced by data augmentation and post-processing to achieve 87.5% QWK.

Details

Motivation: To address the challenges of fine-grained Arabic readability assessment, particularly severe class imbalance and data scarcity in the BAREC 2025 Shared Task.

Method: Ensemble of four transformer models (AraBERTv2, AraELECTRA, MARBERT, CAMeLBERT) with distinct loss functions, weighted training, advanced preprocessing, SAMER corpus relabeling, synthetic data generation via Gemini 2.5 Flash (adding 10,000 rare-level samples), and targeted post-processing.

Result: Achieved first place in all six tracks with 87.5% QWK at sentence level and 87.4% QWK at document level, demonstrating a 6.3% QWK gain from post-processing.

Conclusion: The system demonstrates the effectiveness of model diversity, confidence-informed fusion, and intelligent data augmentation for robust Arabic readability prediction.

Abstract: We present MSAs winning system for the BAREC 2025 Shared Task on fine-grained Arabic readability assessment, achieving first place in six of six tracks. Our approach is a confidence-weighted ensemble of four complementary transformer models (AraBERTv2, AraELECTRA, MARBERT, and CAMeLBERT) each fine-tuned with distinct loss functions to capture diverse readability signals. To tackle severe class imbalance and data scarcity, we applied weighted training, advanced preprocessing, SAMER corpus relabeling with our strongest model, and synthetic data generation via Gemini 2.5 Flash, adding about 10,000 rare-level samples. A targeted post-processing step corrected prediction distribution skew, delivering a 6.3 percent Quadratic Weighted Kappa (QWK) gain. Our system reached 87.5 percent QWK at the sentence level and 87.4 percent at the document level, demonstrating the power of model and loss diversity, confidence-informed fusion, and intelligent augmentation for robust Arabic readability prediction.

[37] Established Psychometric vs. Ecologically Valid Questionnaires: Rethinking Psychological Assessments in Large Language Models

Dongmin Choi, Woojung Song, Jongwook Han, Eun-Ju Lee, Yohan Jo

Main category: cs.CL

TL;DR: Established psychometric questionnaires yield different personality profiles for LLMs compared to ecologically valid questionnaires, showing they don’t adequately reflect real-world LLM behavior and should be used with caution.

Details

Motivation: Concerns about applying human-designed psychological questionnaires to LLMs, particularly their lack of ecological validity in reflecting real-world contexts where LLMs generate responses to user queries.

Method: Comprehensive comparative analysis between established psychometric questionnaires (BFI, PVQ) and ecologically valid questionnaires for measuring LLM personality traits and values.

Result: Established questionnaires: (1) yield substantially different LLM profiles than ecological ones, (2) have insufficient items for stable measurement, (3) create misleading impressions of stable constructs in LLMs, and (4) produce exaggerated profiles for persona-prompted LLMs.

Conclusion: Researchers should caution against using established psychological questionnaires for LLMs due to their lack of ecological validity and misleading results.

Abstract: Researchers have applied established psychometric questionnaires (e.g., BFI, PVQ) to measure the personality traits and values reflected in the responses of Large Language Models (LLMs). However, concerns have been raised about applying these human-designed questionnaires to LLMs. One such concern is their lack of ecological validity–the extent to which survey questions adequately reflect and resemble real-world contexts in which LLMs generate texts in response to user queries. However, it remains unclear how established questionnaires and ecologically valid questionnaires differ in their outcomes, and what insights these differences may provide. In this paper, we conduct a comprehensive comparative analysis of the two types of questionnaires. Our analysis reveals that established questionnaires (1) yield substantially different profiles of LLMs from ecologically valid ones, deviating from the psychological characteristics expressed in the context of user queries, (2) suffer from insufficient items for stable measurement, (3) create misleading impressions that LLMs possess stable constructs, and (4) yield exaggerated profiles for persona-prompted LLMs. Overall, our work cautions against the use of established psychological questionnaires for LLMs. Our code will be released upon publication.

[38] Querying Climate Knowledge: Semantic Retrieval for Scientific Discovery

Mustapha Adamu, Qi Zhang, Huitong Pan, Longin Jan Latecki, Eduard C. Dragut

Main category: cs.CL

TL;DR: A climate science Knowledge Graph that enables semantic queries for precise connections between models, datasets, regions, and variables, integrated with LLMs for improved climate question answering.

Details

Motivation: The growing complexity and volume of climate science literature makes it difficult for researchers to find relevant information across different models, datasets, regions, and variables.

Method: Built a domain-specific Knowledge Graph from climate publications and scientific texts, supporting structured semantic queries using Cypher, and integrated with large language models in RAG systems.

Result: The KG enables precise discovery of connections such as which models have been validated in specific regions or which datasets are commonly used with certain teleconnection patterns.

Conclusion: This work demonstrates real-world value for climate researchers and model developers by providing accurate, contextual scientific information through semantic knowledge discovery.

Abstract: The growing complexity and volume of climate science literature make it increasingly difficult for researchers to find relevant information across models, datasets, regions, and variables. This paper introduces a domain-specific Knowledge Graph (KG) built from climate publications and broader scientific texts, aimed at improving how climate knowledge is accessed and used. Unlike keyword based search, our KG supports structured, semantic queries that help researchers discover precise connections such as which models have been validated in specific regions or which datasets are commonly used with certain teleconnection patterns. We demonstrate how the KG answers such questions using Cypher queries, and outline its integration with large language models in RAG systems to improve transparency and reliability in climate-related question answering. This work moves beyond KG construction to show its real world value for climate researchers, model developers, and others who rely on accurate, contextual scientific information.

[39] Arabic Large Language Models for Medical Text Generation

Abdulrahman Allam, Seif Ahmed, Ali Hamdi, Ammar Mohammed

Main category: cs.CL

TL;DR: Fine-tuned LLMs for Arabic medical text generation to provide accurate medical advice, with Mistral-7B performing best (68.5% F1-score).

Details

Motivation: Address limitations in existing hospital management systems that lack accurate real-time medical advice for irregular inputs and underrepresented languages like Arabic.

Method: Collected real-world medical conversations from social media, preprocessed for Arabic dialects, and fine-tuned Mistral-7B, LLaMA-2-7B, and GPT-2 Medium models.

Result: Fine-tuned Mistral-7B achieved best performance with 68.5% precision, 69.08% recall, and 68.5% F1-score on BERT Score metrics.

Conclusion: Generative AI shows strong potential for advancing hospital management systems, offering scalable solutions for diverse linguistic and cultural healthcare environments.

Abstract: Efficient hospital management systems (HMS) are critical worldwide to address challenges such as overcrowding, limited resources, and poor availability of urgent health care. Existing methods often lack the ability to provide accurate, real-time medical advice, particularly for irregular inputs and underrepresented languages. To overcome these limitations, this study proposes an approach that fine-tunes large language models (LLMs) for Arabic medical text generation. The system is designed to assist patients by providing accurate medical advice, diagnoses, drug recommendations, and treatment plans based on user input. The research methodology required the collection of a unique dataset from social media platforms, capturing real-world medical conversations between patients and doctors. The dataset, which includes patient complaints together with medical advice, was properly cleaned and preprocessed to account for multiple Arabic dialects. Fine-tuning state-of-the-art generative models, such as Mistral-7B-Instruct-v0.2, LLaMA-2-7B, and GPT-2 Medium, optimized the system’s ability to generate reliable medical text. Results from evaluations indicate that the fine-tuned Mistral-7B model outperformed the other models, achieving average BERT (Bidirectional Encoder Representations from Transformers) Score values in precision, recall, and F1-scores of 68.5%, 69.08%, and 68.5%, respectively. Comparative benchmarking and qualitative assessments validate the system’s ability to produce coherent and relevant medical replies to informal input. This study highlights the potential of generative artificial intelligence (AI) in advancing HMS, offering a scalable and adaptable solution for global healthcare challenges, especially in linguistically and culturally diverse environments.

[40] Scaling Arabic Medical Chatbots Using Synthetic Data: Enhancing Generative AI with Synthetic Patient Records

Abdulrahman Allam, Seif Ahmed, Ali Hamdi, Khaled Shaban

Main category: cs.CL

TL;DR: Synthetic data augmentation using ChatGPT-4o and Gemini 2.5 Pro expanded Arabic medical chatbot training data from 20K to 100K records, improving LLM performance with ChatGPT-4o data showing superior results.

Details

Motivation: Addressing the scarcity of large-scale, high-quality annotated Arabic medical datasets that constrain medical chatbot development and limit model scalability and generalization.

Method: Generated 80,000 synthetic question-answer pairs using ChatGPT-4o and Gemini 2.5 Pro, semantically filtered and manually validated them, then fine-tuned five LLMs including Mistral-7B and AraGPT2, with ablation studies comparing synthetic data sources.

Result: ChatGPT-4o generated data consistently led to higher F1-scores and fewer hallucinations across all models compared to Gemini-generated data.

Conclusion: Synthetic augmentation is a viable practical solution for enhancing domain-specific language models in low-resource medical NLP, enabling more inclusive, scalable, and accurate Arabic healthcare chatbot systems.

Abstract: The development of medical chatbots in Arabic is significantly constrained by the scarcity of large-scale, high-quality annotated datasets. While prior efforts compiled a dataset of 20,000 Arabic patient-doctor interactions from social media to fine-tune large language models (LLMs), model scalability and generalization remained limited. In this study, we propose a scalable synthetic data augmentation strategy to expand the training corpus to 100,000 records. Using advanced generative AI systems ChatGPT-4o and Gemini 2.5 Pro we generated 80,000 contextually relevant and medically coherent synthetic question-answer pairs grounded in the structure of the original dataset. These synthetic samples were semantically filtered, manually validated, and integrated into the training pipeline. We fine-tuned five LLMs, including Mistral-7B and AraGPT2, and evaluated their performance using BERTScore metrics and expert-driven qualitative assessments. To further analyze the effectiveness of synthetic sources, we conducted an ablation study comparing ChatGPT-4o and Gemini-generated data independently. The results showed that ChatGPT-4o data consistently led to higher F1-scores and fewer hallucinations across all models. Overall, our findings demonstrate the viability of synthetic augmentation as a practical solution for enhancing domain-specific language models in-low resource medical NLP, paving the way for more inclusive, scalable, and accurate Arabic healthcare chatbot systems.

Zhengyu Hu, Zheyuan Xiao, Max Xiong, Yuxuan Lei, Tianfu Wang, Jianxun Lian, Kaize Ding, Ziang Xiao, Nicholas Jing Yuan, Xing Xie

Main category: cs.CL

TL;DR: A systematic framework for generating high-quality, population-aligned persona sets for LLM-based social simulations, addressing bias issues through narrative persona generation, quality filtering, and importance sampling.

Details

Motivation: Existing LLM-based social simulation studies focus on agent frameworks but overlook persona generation complexities and biases from unrepresentative persona sets, limiting authentic representation of real-world population diversity.

Method: Leverage LLMs to generate narrative personas from social media data, apply quality assessment filtering, use importance sampling for global alignment with psychometric distributions (e.g., Big Five traits), and include task-specific module for subpopulation adaptation.

Result: Extensive experiments show the method significantly reduces population-level bias and enables accurate, flexible social simulation for various research and policy applications.

Conclusion: The proposed framework provides a systematic approach to synthesizing high-fidelity persona sets that authentically represent real-world population diversity, addressing critical bias issues in LLM-driven social simulations.

Abstract: Recent advances in large language models (LLMs) have enabled human-like social simulations at unprecedented scale and fidelity, offering new opportunities for computational social science. A key challenge, however, is the construction of persona sets that authentically represent the diversity and distribution of real-world populations. Most existing LLM-based social simulation studies focus primarily on designing agentic frameworks and simulation environments, often overlooking the complexities of persona generation and the potential biases introduced by unrepresentative persona sets. In this paper, we propose a systematic framework for synthesizing high-quality, population-aligned persona sets for LLM-driven social simulation. Our approach begins by leveraging LLMs to generate narrative personas from long-term social media data, followed by rigorous quality assessment to filter out low-fidelity profiles. We then apply importance sampling to achieve global alignment with reference psychometric distributions, such as the Big Five personality traits. To address the needs of specific simulation contexts, we further introduce a task-specific module that adapts the globally aligned persona set to targeted subpopulations. Extensive experiments demonstrate that our method significantly reduces population-level bias and enables accurate, flexible social simulation for a wide range of research and policy applications.

[42] Towards Reliable and Interpretable Document Question Answering via VLMs

Alessio Chen, Simone Giovannini, Andrea Gemelli, Fabio Coppini, Simone Marinai

Main category: cs.CL

TL;DR: DocExplainerV0 is a plug-and-play bounding-box module that decouples answer generation from spatial localization to improve document answer localization in Vision-Language Models.

Details

Motivation: Accurately localizing answers within documents remains a major challenge for VLMs, limiting interpretability and real-world applicability despite their strong text extraction capabilities.

Method: Introduces DocExplainerV0, a plug-and-play bounding-box prediction module that works with existing VLMs without requiring fine-tuning, decoupling answer generation from spatial localization.

Result: Systematic evaluation reveals a gap between textual accuracy and spatial grounding, showing correct answers often lack reliable localization. The framework highlights these shortcomings.

Conclusion: Establishes a benchmark for future research toward more interpretable and robust document information extraction VLMs, addressing the localization challenge in document understanding.

Abstract: Vision-Language Models (VLMs) have shown strong capabilities in document understanding, particularly in identifying and extracting textual information from complex documents. Despite this, accurately localizing answers within documents remains a major challenge, limiting both interpretability and real-world applicability. To address this, we introduce \textit{DocExplainerV0}, a plug-and-play bounding-box prediction module that decouples answer generation from spatial localization. This design makes it applicable to existing VLMs, including proprietary systems where fine-tuning is not feasible. Through systematic evaluation, we provide quantitative insights into the gap between textual accuracy and spatial grounding, showing that correct answers often lack reliable localization. Our standardized framework highlights these shortcomings and establishes a benchmark for future research toward more interpretable and robust document information extraction VLMs.

[43] Benchmark of stylistic variation in LLM-generated texts

Jiří Milička, Anna Marklová, Václav Cvrček

Main category: cs.CL

TL;DR: This study compares human-written texts with AI-generated texts using Biber’s multidimensional analysis to identify systematic differences, creating a benchmark for evaluating LLM performance across English and Czech.

Details

Motivation: To understand how large language models differ from human writers in register variation and create an interpretable benchmark for comparing different LLM models.

Method: Applied Biber’s multidimensional analysis (MDA) to human-written texts and AI-generated counterparts using AI-Brown corpus (comparable to BE-21) for English and AI-Koditex for Czech. Analyzed 16 frontier LLM models in various settings, comparing base models vs instruction-tuned models.

Result: Identified specific dimensions of variation where LLMs differ most significantly and systematically from human writers. Created a benchmark for model comparison and ranking.

Conclusion: The study provides a systematic framework for evaluating LLM performance against human writing standards, revealing consistent patterns of difference that can guide future model development and assessment.

Abstract: This study investigates the register variation in texts written by humans and comparable texts produced by large language models (LLMs). Biber’s multidimensional analysis (MDA) is applied to a sample of human-written texts and AI-created texts generated to be their counterparts to find the dimensions of variation in which LLMs differ most significantly and most systematically from humans. As textual material, a new LLM-generated corpus AI-Brown is used, which is comparable to BE-21 (a Brown family corpus representing contemporary British English). Since all languages except English are underrepresented in the training data of frontier LLMs, similar analysis is replicated on Czech using AI-Koditex corpus and Czech multidimensional model. Examined were 16 frontier models in various settings and prompts, with emphasis placed on the difference between base models and instruction-tuned models. Based on this, a benchmark is created through which models can be compared with each other and ranked in interpretable dimensions.

[44] Incongruent Positivity: When Miscalibrated Positivity Undermines Online Supportive Conversations

Leen Almajed, Abeer ALdayel

Main category: cs.CL

TL;DR: LLMs often generate incongruent positivity - overly optimistic or dismissive responses that misfire in emotional support contexts, especially in high-stakes situations like grief and anxiety.

Details

Motivation: To examine how well-intended positivity can backfire in emotionally supportive conversations, particularly in both human and LLM-generated responses, and to understand how this varies across different emotional intensity levels.

Method: Collected real user-assistant dialogues from Reddit across emotional intensities, generated additional LLM responses, categorized conversations into Mild (relationship tension, general advice) and Severe (grief, anxiety) levels, fine-tuned LLMs on datasets with different emotional reactions, and developed a weakly supervised multilabel classifier ensemble using DeBERTa and MentalBERT.

Result: LLMs are more prone to unrealistic positivity through dismissive and minimizing tone, particularly in high-stakes contexts. The developed classifier ensemble showed improved detection of incongruent positivity types across different concern levels.

Conclusion: There’s a need to move beyond generic positive responses and develop congruent support measures that balance positive affect with emotional acknowledgment, paving the way for context-aware and trust-preserving online conversation systems.

Abstract: In emotionally supportive conversations, well-intended positivity can sometimes misfire, leading to responses that feel dismissive, minimizing, or unrealistically optimistic. We examine this phenomenon of incongruent positivity as miscalibrated expressions of positive support in both human and LLM generated responses. To this end, we collected real user-assistant dialogues from Reddit across a range of emotional intensities and generated additional responses using large language models for the same context. We categorize these conversations by intensity into two levels: Mild, which covers relationship tension and general advice, and Severe, which covers grief and anxiety conversations. This level of categorization enables a comparative analysis of how supportive responses vary across lower and higher stakes contexts. Our analysis reveals that LLMs are more prone to unrealistic positivity through dismissive and minimizing tone, particularly in high-stakes contexts. To further study the underlying dimensions of this phenomenon, we finetune LLMs on datasets with strong and weak emotional reactions. Moreover, we developed a weakly supervised multilabel classifier ensemble (DeBERTa and MentalBERT) that shows improved detection of incongruent positivity types across two sorts of concerns (Mild and Severe). Our findings shed light on the need to move beyond merely generating generic positive responses and instead study the congruent support measures to balance positive affect with emotional acknowledgment. This approach offers insights into aligning large language models with affective expectations in the online supportive dialogue, paving the way toward context-aware and trust preserving online conversation systems.

[45] Beyond Token Limits: Assessing Language Model Performance on Long Text Classification

Miklós Sebők, Viktor Kovács, Martin Bánóczy, Daniel Møller Eriksen, Nathalie Neptune, Philippe Roussille

Main category: cs.CL

TL;DR: Analysis of long-text classification performance for legal documents using various LLMs, showing open models outperform specialized long-context models and GPT variants.

Details

Motivation: Address limitations of standard LLMs (like BERT/RoBERTa) that can't process long legal documents (hundreds of pages) due to token length constraints, particularly for policy topic classification tasks.

Method: Experiments with 5 languages using XLM-RoBERTa, Longformer, GPT-3.5, and GPT-4 models for multiclass classification of legal documents according to the Comparative Agendas Project’s 21 policy topic labels.

Result: No advantage found for Longformer (specifically designed for long inputs). Open models outperformed GPT variants. Performance influenced by support and substance overlaps between policy categories.

Conclusion: Specialized long-context models don’t necessarily outperform standard open models for long-text classification tasks, with category relationships being more important than model architecture for this domain.

Abstract: The most widely used large language models in the social sciences (such as BERT, and its derivatives, e.g. RoBERTa) have a limitation on the input text length that they can process to produce predictions. This is a particularly pressing issue for some classification tasks, where the aim is to handle long input texts. One such area deals with laws and draft laws (bills), which can have a length of multiple hundred pages and, therefore, are not particularly amenable for processing with models that can only handle e.g. 512 tokens. In this paper, we show results from experiments covering 5 languages with XLM-RoBERTa, Longformer, GPT-3.5, GPT-4 models for the multiclass classification task of the Comparative Agendas Project, which has a codebook of 21 policy topic labels from education to health care. Results show no particular advantage for the Longformer model, pre-trained specifically for the purposes of handling long inputs. The comparison between the GPT variants and the best-performing open model yielded an edge for the latter. An analysis of class-level factors points to the importance of support and substance overlaps between specific categories when it comes to performance on long text inputs.

[46] SI-FACT: Mitigating Knowledge Conflict via Self-Improving Faithfulness-Aware Contrastive Tuning

Shengqiang Fu

Main category: cs.CL

TL;DR: SI FACT is a self-improving framework that uses contrastive learning to enhance LLM faithfulness by automatically generating training data and reducing reliance on internal knowledge when external context is provided.

Details

Motivation: Large Language Models often generate unfaithful responses in knowledge-intensive tasks due to knowledge conflict - preferring internal parametric knowledge over provided context, which undermines trustworthiness.

Method: Self Improving Faithfulness Aware Contrastive Tuning (SI FACT) uses self-instruct mechanism to automatically generate structured contrastive learning data (anchor, positive, negative samples) and applies contrastive learning to train models to distinguish faithful from unfaithful responses.

Result: Experiments on ECARE KRE and COSE KRE benchmarks show SI FACT based on Llama3 8B Instruct improves Contextual Recall Rate by 6.2% over best baseline and significantly reduces dependence on internal memory.

Conclusion: SI FACT provides strong effectiveness and high data efficiency in enhancing contextual faithfulness of LLMs, offering a practical pathway toward building more proactive and trustworthy language models.

Abstract: Large Language Models often generate unfaithful responses in knowledge intensive tasks due to knowledge conflict,that is,a preference for relying on internal parametric knowledge rather than the provided context.To address this issue,we propose a novel self improving framework,Self Improving Faithfulness Aware Contrastive Tuning.The framework uses a self instruct mechanism that allows the base LLM to automatically generate high quality,structured contrastive learning data,including anchor samples,semantically equivalent positive samples,and negative samples simulating unfaithful scenarios.This approach significantly reduces the cost of manual annotation.Subsequently,contrastive learning is applied to train the model,enabling it to pull faithful responses closer and push unfaithful responses farther apart in the representation space.Experiments on knowledge conflict evaluation benchmarks ECARE KRE and COSE KRE show that the SI FACT model based on Llama3 8B Instruct improves the Contextual Recall Rate by 6.2% over the best baseline method,while significantly reducing dependence on internal memory.The results indicate that SI FACT provides strong effectiveness and high data efficiency in enhancing the contextual faithfulness of LLMs,offering a practical pathway toward building more proactive and trustworthy language models.

[47] Dropping Experts, Recombining Neurons: Retraining-Free Pruning for Sparse Mixture-of-Experts LLMs

Yixiao Zhou, Ziyu Zhao, Dongzhou Cheng, zhiliang wu, Jie Gui, Yi Yang, Fei Wu, Yu Cheng, Hehe Fan

Main category: cs.CL

TL;DR: DERN is a retraining-free framework that prunes redundant experts and recombines neurons at the segment level to reduce memory usage in SMoE models while improving performance.

Details

Motivation: SMoE architectures require loading all expert parameters despite only activating a few, leading to high memory usage and deployment challenges. Previous methods focused on expert-level operations but neglected neuron-level structure.

Method: Three-step approach: 1) Prune redundant experts using router statistics, 2) Decompose experts into neuron-level segments and assign to compatible retained experts, 3) Merge segments within each retained expert to create compact representations.

Result: Improves performance by more than 5% on commonsense reasoning and MMLU benchmarks under 50% expert sparsity, reduces number of experts and memory usage without extra training.

Conclusion: DERN provides an effective task-agnostic and retraining-free solution for expert pruning and reconstruction, making SMoE LLMs more practical for deployment while maintaining or improving performance.

Abstract: Sparse Mixture-of-Experts (SMoE) architectures are widely used in large language models (LLMs) due to their computational efficiency. However, though only a few experts are activated for each token, SMoE still requires loading all expert parameters, leading to high memory usage and challenges in deployment. Previous work has tried to reduce the overhead by pruning and merging experts, but primarily focused on expert-level operations, leaving neuron-level structure underexplored. We propose DERN (Dropping Experts, Recombining Neurons), a task-agnostic and retraining-free framework for expert pruning and reconstruction. We observe that experts are often misaligned and contain semantic conflicts at the neuron level, which poses challenges for direct merging. To solve this, DERN works in three steps: it first prunes redundant experts using router statistics; then it decomposes them into neuron-level expert segments, assigning each segment to its most compatible retained expert; and finally, it merges segments within each retained expert to build a compact representation. Experiments on Mixtral, Qwen, and DeepSeek SMoE models show that DERN improves performance by more than 5% on commonsense reasoning and MMLU benchmarks under 50% expert sparsity, without extra training. It also greatly reduces the number of experts and memory usage, making SMoE LLMs easier to deploy in practice.

[48] Is In-Context Learning Learning?

Adrian de Wynter

Main category: cs.CL

TL;DR: ICL enables autoregressive models to solve tasks via next-token prediction without training, but analysis shows it constitutes learning with limitations in generalization to unseen tasks.

Details

Motivation: To investigate whether in-context learning (ICL) truly constitutes learning or is merely deduction, and to characterize its capabilities and limitations through empirical analysis.

Method: Conducted large-scale analysis of ICL by ablating and accounting for memorization, pretraining, distributional shifts, and prompting style/phrasing across varied tasks.

Result: ICL is an effective learning paradigm but limited in generalization to unseen tasks. With numerous exemplars, accuracy becomes insensitive to exemplar distribution, model, prompt style, and linguistic features, instead deducing patterns from prompt regularities.

Conclusion: Autoregression’s ad-hoc encoding is not a robust mechanism, suggesting limited all-purpose generalizability despite ICL’s effectiveness as a learning approach.

Abstract: In-context learning (ICL) allows some autoregressive models to solve tasks via next-token prediction and without needing further training. This has led to claims about these model’s ability to solve (learn) unseen tasks with only a few shots (exemplars) in the prompt. However, deduction does not always imply learning, as ICL does not explicitly encode a given observation. Instead, the models rely on their prior knowledge and the exemplars given, if any. We argue that, mathematically, ICL does constitute learning, but its full characterisation requires empirical work. We then carry out a large-scale analysis of ICL ablating out or accounting for memorisation, pretraining, distributional shifts, and prompting style and phrasing. We find that ICL is an effective learning paradigm, but limited in its ability to learn and generalise to unseen tasks. We note that, in the limit where exemplars become more numerous, accuracy is insensitive to exemplar distribution, model, prompt style, and the input’s linguistic features. Instead, it deduces patterns from regularities in the prompt, which leads to distributional sensitivity, especially in prompting styles such as chain-of-thought. Given the varied accuracies on formally similar tasks, we conclude that autoregression’s ad-hoc encoding is not a robust mechanism, and suggests limited all-purpose generalisability.

[49] Long Context Automated Essay Scoring with Language Models

Christopher Ormerod, Gitit Kehat

Main category: cs.CL

TL;DR: This paper evaluates transformer models with architectural modifications to handle long essays for automated scoring, addressing the limitation of fixed-length inputs in standard transformers.

Details

Motivation: Standard transformer models have fixed maximum text length constraints, forcing truncation of longer student essays which undermines valid assessment of organizational elements in automated essay scoring.

Method: Evaluated several modified transformer architectures (XLNet, Longformer, ModernBERT, Mamba, Llama) fine-tuned on the Kaggle ASAP 2.0 dataset to overcome length limitations.

Result: The study assessed the performance of these length-adapted models for automated essay scoring, though specific results are not provided in the abstract.

Conclusion: Architectural modifications to transformer models can potentially address the length constraints that compromise validity in automated essay scoring systems.

Abstract: Transformer-based language models are architecturally constrained to process text of a fixed maximum length. Essays written by higher-grade students frequently exceed the maximum allowed length for many popular open-source models. A common approach to addressing this issue when using these models for Automated Essay Scoring is to truncate the input text. This raises serious validity concerns as it undermines the model’s ability to fully capture and evaluate organizational elements of the scoring rubric, which requires long contexts to assess. In this study, we evaluate several models that incorporate architectural modifications of the standard transformer architecture to overcome these length limitations using the Kaggle ASAP 2.0 dataset. The models considered in this study include fine-tuned versions of XLNet, Longformer, ModernBERT, Mamba, and Llama models.

[50] RefactorCoderQA: Benchmarking LLMs for Multi-Domain Coding Question Solutions in Cloud and Edge Deployment

Shadikur Rahman, Aroosa Hameed, Gautam Srivastava, Syed Muhammad Danish

Main category: cs.CL

TL;DR: A cloud-edge collaborative architecture with GuideLLM, SolverLLM, and JudgeLLM agents for structured problem-solving, achieving 76.84% accuracy on the new RefactorCoderQA benchmark.

Details

Motivation: To overcome limitations of existing benchmarks and optimize LLMs' reasoning capabilities through a structured multi-agent framework that leverages both edge and cloud computing.

Method: Proposed a three-component architecture: GuideLLM (edge-based guidance), SolverLLM (cloud-based code generation), and JudgeLLM (automated evaluation). Created RefactorCoderQA benchmark with authentic Stack Overflow challenges across multiple domains.

Result: RefactorCoder-MoE achieved state-of-the-art 76.84% accuracy, outperforming leading open-source and commercial baselines. Human evaluations confirmed solution quality, and system metrics showed good throughput/latency performance.

Conclusion: The cloud-edge collaborative architecture with specialized agents effectively enhances LLM reasoning and problem-solving capabilities, demonstrating superior performance across diverse coding domains with practical relevance.

Abstract: To optimize the reasoning and problem-solving capabilities of Large Language Models (LLMs), we propose a novel cloud-edge collaborative architecture that enables a structured, multi-agent prompting framework. This framework comprises three specialized components: GuideLLM, a lightweight model deployed at the edge to provide methodological guidance; SolverLLM, a more powerful model hosted in the cloud responsible for generating code solutions; and JudgeLLM, an automated evaluator for assessing solution correctness and quality. To evaluate and demonstrate the effectiveness of this architecture in realistic settings, we introduce RefactorCoderQA, a comprehensive benchmark designed to evaluate and enhance the performance of Large Language Models (LLMs) across multi-domain coding tasks. Motivated by the limitations of existing benchmarks, RefactorCoderQA systematically covers various technical domains, including Software Engineering, Data Science, Machine Learning, and Natural Language Processing, using authentic coding challenges from Stack Overflow. Extensive experiments reveal that our fine-tuned model, RefactorCoder-MoE, achieves state-of-the-art performance, significantly outperforming leading open-source and commercial baselines with an overall accuracy of 76.84%. Human evaluations further validate the interpretability, accuracy, and practical relevance of the generated solutions. In addition, we evaluate system-level metrics, such as throughput and latency, to gain deeper insights into the performance characteristics and trade-offs of the proposed architecture.

[51] DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL

Rui Lu, Zhenyu Hou, Zihan Wang, Hanchen Zhang, Xiao Liu, Yujiang Li, Shi Feng, Jie Tang, Yuxiao Dong

Main category: cs.CL

TL;DR: DeepDive is a system that enhances LLMs as deep search agents through automated question synthesis from knowledge graphs and multi-turn reinforcement learning, achieving state-of-the-art performance on browsing benchmarks.

Details

Motivation: Open LLMs perform poorly as deep search agents due to limited long-horizon reasoning with browsing tools and lack of sufficiently difficult supervised training data.

Method: Automatically synthesizes complex questions from open knowledge graphs and applies end-to-end multi-turn reinforcement learning to enhance LLMs’ long-horizon reasoning capabilities for deep search.

Result: DeepDive-32B achieves new open-source competitive results on BrowseComp, outperforming WebSailor, DeepSeek-R1-Browse, and Search-o1. Multi-turn RL training significantly improves deep search ability across multiple benchmarks.

Conclusion: The approach enables test-time scaling of tool calls and parallel sampling, demonstrating that automated question synthesis combined with multi-turn RL effectively enhances LLMs’ capabilities as deep search agents.

Abstract: Augmenting large language models (LLMs) with browsing tools substantially improves their potential as deep search agents to solve complex, real-world tasks. Yet, open LLMs still perform poorly in such settings due to limited long-horizon reasoning capacity with browsing tools and the lack of sufficiently difficult supervised data. To address these challenges, we present DeepDive to advance deep search agents. First, we propose a strategy to automatically synthesize complex, difficult, and hard-to-find questions from open knowledge graphs. Second, we apply end-to-end multi-turn reinforcement learning (RL) to enhance LLMs’ long-horizon reasoning with deep search. Experiments show that DeepDive-32B achieves a new open-source competitive result on BrowseComp, outperforming WebSailor, DeepSeek-R1-Browse, and Search-o1. We demonstrate that multi-turn RL training improves deep search ability and significantly contributes to the performance improvements across multiple benchmarks. We observe that DeepDive enables test-time scaling of tool calls and parallel sampling. All datasets, models, and code are publicly available at https://github.com/THUDM/DeepDive.

[52] WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers

Akshat Pandey, Karun Kumar, Raphael Tang

Main category: cs.CL

TL;DR: WhisTLE is a text-only adaptation method for pretrained ASR models that uses a variational autoencoder to model encoder outputs from text, enabling domain adaptation without speech data while maintaining original runtime performance.

Details

Motivation: Pretrained ASR models like Whisper need domain adaptation for unseen vocabulary and parlance, but collecting speech data is often impractical in real-world settings, necessitating text-only adaptation approaches.

Method: Proposes WhisTLE which trains a VAE to model encoder outputs from text, fine-tunes the decoder using the learned text-to-latent encoder, and optionally combines with TTS adaptation. Original encoder is restored at inference.

Result: Across 4 out-of-domain datasets and 4 ASR models, WhisTLE with TTS reduces WER by 12.3% relative to TTS-only adaptation and outperforms all non-WhisTLE baselines in 27 of 32 scenarios.

Conclusion: WhisTLE provides effective text-only domain adaptation for ASR models, significantly improving performance on out-of-domain data without requiring speech collection or adding inference overhead.

Abstract: Pretrained automatic speech recognition (ASR) models such as Whisper perform well but still need domain adaptation to handle unseen vocabulary and parlance. In many real-world settings, collecting speech data is impractical, necessitating text-only adaptation. We propose WhisTLE, a deeply supervised, text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE trains a variational autoencoder (VAE) to model encoder outputs from text and fine-tunes the decoder using the learned text-to-latent encoder, optionally combined with text-to-speech (TTS) adaptation. At inference, the original encoder is restored, incurring no extra runtime cost. Across four out-of-domain datasets and four ASR models, WhisTLE with TTS reduces word error rate (WER) by 12.3% relative to TTS-only adaptation and outperforms all non-WhisTLE baselines in 27 of 32 scenarios.

[53] Slaves to the Law of Large Numbers: An Asymptotic Equipartition Property for Perplexity in Generative Language Models

Tyler Bell, Avinash Mudireddy, Ivan Johnson-Eversoll, Soura Dasgupta, Raghu Mudumbai

Main category: cs.CL

TL;DR: Language models’ long synthetic texts must converge to a typical set defined by average entropy, and this grammatically correct typical set is vanishingly small, constraining model outputs.

Details

Motivation: To understand the fundamental limitations and predictable patterns in language model outputs, particularly for long synthetic texts, and to establish theoretical foundations for detecting synthetic content.

Method: Proved asymptotic convergence of logarithmic perplexity to average entropy, defined a typical set for synthetic texts, refined it to include only grammatically correct texts, and analyzed its size relative to all possible grammatically correct texts.

Result: Long texts generated by language models must belong to a typical set that converges to average entropy, and this grammatically correct typical set is vanishingly small compared to all possible grammatically correct texts.

Conclusion: Language models are strongly constrained in their output range, making their behavior predictable, which has practical applications for synthetic text detection and membership inference without requiring simplifying assumptions.

Abstract: We prove a new asymptotic un-equipartition property for the perplexity of long texts generated by a language model and present supporting experimental evidence from open-source models. Specifically we show that the logarithmic perplexity of any large text generated by a language model must asymptotically converge to the average entropy of its token distributions. This defines a ``typical set’’ that all long synthetic texts generated by a language model must belong to. We refine the concept of ‘’typical set’’ to include only grammatically correct texts. We then show that this refined typical set is a vanishingly small subset of all possible grammatically correct texts for a very general definition of grammar. This means that language models are strongly constrained in the range of their possible behaviors and outputs. We make no simplifying assumptions (such as stationarity) about the statistics of language model outputs, and therefore our results are directly applicable to practical real-world models without any approximations. We discuss possible applications of the typical set concept to problems such as detecting synthetic texts and membership inference in training datasets.

[54] UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs

Wenhao Li, Mingbao Lin, Yunshan Zhong, Shuicheng Yan, Rongrong Ji

Main category: cs.CL

TL;DR: UIO-LLMs introduces an unbiased incremental optimization approach for memory-enhanced transformers to handle long contexts efficiently, extending Llama2-7b-chat from 4K to 100K tokens with minimal parameter overhead.

Details

Motivation: Large language models struggle with long texts due to limited context window sizes, creating a need for efficient methods to handle extended contexts without excessive computational costs.

Method: Uses a weights-shared encoder-decoder framework to encapsulate context segments into memories, treats transformers as fully-connected RNNs, and applies Truncated Backpropagation Through Time with innovative incremental optimization techniques for unbiased gradient computation.

Result: Successfully extends context window of Llama2-7b-chat from 4K to 100K tokens with only 2% additional parameters while maintaining nearly linear inference cost as context length increases.

Conclusion: UIO-LLMs provides an effective solution for managing long contexts in LLMs through memory-enhanced transformers with unbiased incremental optimization, achieving significant context extension with minimal parameter overhead.

Abstract: Managing long texts is challenging for large language models (LLMs) due to limited context window sizes. This study introduces UIO-LLMs, an unbiased incremental optimization approach for memory-enhanced transformers under long-context settings. We initially conceptualize the process as a streamlined encoder-decoder framework where the weights-shared encoder and decoder respectively encapsulate a context segment into memories and leverage these memories to predict outputs of the subsequent segment. Subsequently, by treating our memory-enhanced transformers as fully-connected recurrent neural networks (RNNs), we refine the training process using the Truncated Backpropagation Through Time (TBPTT) algorithm, which incorporates innovative incremental optimization techniques. These techniques not only diminish time complexity but also address the bias in gradient computation through an unbiased optimization process. UIO-LLMs successfully handle long context, such as extending the context window of Llama2-7b-chat from 4K to 100K tokens with minimal 2% additional parameters, while keeping the inference cost nearly linear as context length increases.

[55] Direct Judgement Preference Optimization

Peifeng Wang, Austin Xu, Yilun Zhou, Caiming Xiong, Shafiq Joty

Main category: cs.CL

TL;DR: This paper introduces a method to enhance LLM evaluation capabilities using preference optimization with both positive and negative data, achieving state-of-the-art performance on most benchmarks and demonstrating robustness against biases.

Details

Motivation: Auto-evaluation is crucial for assessing response quality and providing feedback for model development. Recent studies have explored using LLMs as generative judges, but there's a need to improve their evaluation capabilities across diverse use cases.

Method: The authors employ three approaches to collect preference pairs for different use cases, using preference optimization to learn from both positive and negative data. This enhances the evaluation capabilities of LLM judges from multiple perspectives.

Result: The generative judge achieves best performance on 10 out of 13 benchmarks, outperforming strong baselines like GPT-4o and specialized judge models. It robustly counters inherent biases (position and length bias), adapts to any evaluation protocol, and provides helpful language feedback for improving downstream generator models.

Conclusion: The proposed method effectively enhances LLM evaluation capabilities through preference optimization with both positive and negative data, demonstrating superior performance, bias robustness, and practical utility for model improvement.

Abstract: Auto-evaluation is crucial for assessing response quality and offering feedback for model development. Recent studies have explored training large language models (LLMs) as generative judges to evaluate and critique other models’ outputs. In this work, we investigate the idea of learning from both positive and negative data with preference optimization to enhance the evaluation capabilities of LLM judges across an array of different use cases. We achieve this by employing three approaches to collect the preference pairs for different use cases, each aimed at improving our generative judge from a different perspective. Our comprehensive study over a wide range of benchmarks demonstrates the effectiveness of our method. In particular, our generative judge achieves the best performance on 10 out of 13 benchmarks, outperforming strong baselines like GPT-4o and specialized judge models. Further analysis show that our judge model robustly counters inherent biases such as position and length bias, flexibly adapts to any evaluation protocol specified by practitioners, and provides helpful language feedback for improving downstream generator models.

[56] Atomic Fact Decomposition Helps Attributed Question Answering

Zhichao Yan, Jiapu Wang, Jiaoyan Chen, Xiaoli Li, Ru Li, Jeff Z. Pan

Main category: cs.CL

TL;DR: Proposes ARE framework that decomposes LLM-generated answers into atomic facts, retrieves evidence for verification, and edits/backtracks facts to improve attribution accuracy in question answering.

Details

Motivation: Existing AQA approaches have limitations: RTR suffers from irrelevant knowledge and outdated information, while post-hoc retrieval struggles with complex long-form answers and precise revision while preserving intent.

Method: ARE framework decomposes long-form answers into molecular clauses and atomic facts using instruction-tuned LLMs. Retrieves evidence for atomic facts, uses LLM verifier to determine need for expansion/editing, then backtracks edited facts into original answer with evidence aggregation.

Result: Superior performance over state-of-the-art methods on multiple datasets, with proposed new metric Attr_p for evaluating evidence attribution precision.

Conclusion: ARE framework effectively addresses limitations of existing AQA approaches by combining atomic fact decomposition with evidence-based verification and editing, achieving better attribution accuracy.

Abstract: Attributed Question Answering (AQA) aims to provide both a trustworthy answer and a reliable attribution report for a given question. Retrieval is a widely adopted approach, including two general paradigms: Retrieval-Then-Read (RTR) and post-hoc retrieval. Recently, Large Language Models (LLMs) have shown remarkable proficiency, prompting growing interest in AQA among researchers. However, RTR-based AQA often suffers from irrelevant knowledge and rapidly changing information, even when LLMs are adopted, while post-hoc retrieval-based AQA struggles with comprehending long-form answers with complex logic, and precisely identifying the content needing revision and preserving the original intent. To tackle these problems, this paper proposes an Atomic fact decomposition-based Retrieval and Editing (ARE) framework, which decomposes the generated long-form answers into molecular clauses and atomic facts by the instruction-tuned LLMs. Notably, the instruction-tuned LLMs are fine-tuned using a well-constructed dataset, generated from large scale Knowledge Graphs (KGs). This process involves extracting one-hop neighbors from a given set of entities and transforming the result into coherent long-form text. Subsequently, ARE leverages a search engine to retrieve evidences related to atomic facts, inputting these evidences into an LLM-based verifier to determine whether the facts require expansion for re-retrieval or editing. Furthermore, the edited facts are backtracked into the original answer, with evidence aggregated based on the relationship between molecular clauses and atomic facts. Extensive evaluations demonstrate the superior performance of our proposed method over the state-of-the-arts on several datasets, with an additionally proposed new metric $Attr_{p}$ for evaluating the precision of evidence attribution.

[57] Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance

Omer Nahum, Nitay Calderon, Orgad Keller, Idan Szpektor, Roi Reichart

Main category: cs.CL

TL;DR: LLMs can effectively detect label errors in NLP datasets, revealing that many reported model failures are actually due to annotation errors rather than genuine model shortcomings.

Details

Motivation: Traditional expert annotation is costly and doesn't scale, while crowd-sourcing often sacrifices precision. LLMs offer a scalable solution for improving dataset quality by detecting mislabeled examples.

Method: Used LLM-as-a-judge approach with ensemble of LLMs to flag potentially mislabeled examples in four factual consistency datasets from TRUE benchmark and SummEval dataset. Compared expert, crowd-sourced, and LLM-based annotations for agreement, quality, and efficiency.

Result: Found substantial number of label errors across datasets. Correcting these errors led to significant upward shift in reported model performance, showing many ‘mistakes’ were actually annotation errors.

Conclusion: LLMs provide effective scalable annotation solution. Label errors significantly impact performance metrics, and methods to mitigate mislabeled data in training can improve model performance.

Abstract: NLP benchmarks rely on standardized datasets for training and evaluating models and are crucial for advancing the field. Traditionally, expert annotations ensure high-quality labels; however, the cost of expert annotation does not scale well with the growing demand for larger datasets required by modern models. While crowd-sourcing provides a more scalable solution, it often comes at the expense of annotation precision and consistency. Recent advancements in large language models (LLMs) offer new opportunities to enhance the annotation process, particularly for detecting label errors in existing datasets. In this work, we consider the recent approach of LLM-as-a-judge, leveraging an ensemble of LLMs to flag potentially mislabeled examples. We conduct a case study on four factual consistency datasets from the TRUE benchmark, spanning diverse NLP tasks, and on SummEval, which uses Likert-scale ratings of summary quality across multiple dimensions. We empirically analyze the labeling quality of existing datasets and compare expert, crowd-sourced, and LLM-based annotations in terms of the agreement, label quality, and efficiency, demonstrating the strengths and limitations of each annotation method. Our findings reveal a substantial number of label errors, which, when corrected, induce a significant upward shift in reported model performance. This suggests that many of the LLMs’ so-called mistakes are due to label errors rather than genuine model failures. Additionally, we discuss the implications of mislabeled data and propose methods to mitigate them in training to improve performance.

[58] Polish-English medical knowledge transfer: A new benchmark and results

Łukasz Grzybowski, Jakub Pokrywka, Michał Ciesiółka, Jeremi I. Kaczmarek, Marek Kubis

Main category: cs.CL

TL;DR: This study introduces a Polish medical exam benchmark dataset with over 24,000 questions to evaluate LLMs’ performance in non-English medical contexts, revealing significant cross-lingual challenges despite GPT-4o’s near-human performance.

Details

Motivation: Most LLM studies focus on English-language contexts, creating a gap in understanding how these models perform in other languages, particularly in specialized domains like medicine where accurate multilingual capabilities are crucial.

Method: Created a benchmark dataset from Polish medical licensing exams (LEK, LDEK, PES) with over 24,000 questions, including professionally translated English versions. Evaluated various LLMs including general-purpose, domain-specific, and Polish-specific models against human medical students.

Result: GPT-4o achieved near-human performance, but significant challenges were identified in cross-lingual translation and domain-specific understanding. Performance disparities were observed across languages and medical specialties.

Conclusion: The study highlights limitations and ethical considerations for deploying LLMs in clinical practice, emphasizing the need for improved multilingual capabilities and domain-specific understanding in non-English medical contexts.

Abstract: Large Language Models (LLMs) have demonstrated significant potential in handling specialized tasks, including medical problem-solving. However, most studies predominantly focus on English-language contexts. This study introduces a novel benchmark dataset based on Polish medical licensing and specialization exams (LEK, LDEK, PES) taken by medical doctor candidates and practicing doctors pursuing specialization. The dataset was web-scraped from publicly available resources provided by the Medical Examination Center and the Chief Medical Chamber. It comprises over 24,000 exam questions, including a subset of parallel Polish-English corpora, where the English portion was professionally translated by the examination center for foreign candidates. By creating a structured benchmark from these existing exam questions, we systematically evaluate state-of-the-art LLMs, including general-purpose, domain-specific, and Polish-specific models, and compare their performance against human medical students. Our analysis reveals that while models like GPT-4o achieve near-human performance, significant challenges persist in cross-lingual translation and domain-specific understanding. These findings underscore disparities in model performance across languages and medical specialties, highlighting the limitations and ethical considerations of deploying LLMs in clinical practice.

[59] A 2-step Framework for Automated Literary Translation Evaluation: Its Promises and Pitfalls

Sheikh Shafayat, Dongkeun Yoon, Woori Jang, Jiwoo Choi, Alice Oh, Seohyon Jung

Main category: cs.CL

TL;DR: A two-stage pipeline for fine-grained evaluation of English-to-Korean literary machine translation shows better correlation with human judgment than traditional metrics but still falls short of human agreement, particularly in culturally sensitive areas like honorifics.

Details

Motivation: To develop a more sophisticated evaluation framework for literary machine translation that provides fine-grained, interpretable metrics and better captures cultural nuances in translations.

Method: A two-stage pipeline approach for evaluating literary machine translation from English to Korean, using interpretable metrics designed specifically for literary content.

Result: The framework achieved higher correlation with human judgment compared to traditional MT metrics but still couldn’t match inter-human agreement, especially for Korean honorifics. LLMs showed bias toward translations from other LLMs.

Conclusion: Current evaluation methods need improvement to ensure accurate and culturally sensitive literary translation, particularly for language-specific features like honorifics, requiring more sophisticated evaluation approaches.

Abstract: In this work, we propose and evaluate the feasibility of a two-stage pipeline to evaluate literary machine translation, in a fine-grained manner, from English to Korean. The results show that our framework provides fine-grained, interpretable metrics suited for literary translation and obtains a higher correlation with human judgment than traditional machine translation metrics. Nonetheless, it still fails to match inter-human agreement, especially in metrics like Korean Honorifics. We also observe that LLMs tend to favor translations generated by other LLMs, and we highlight the necessity of developing more sophisticated evaluation methods to ensure accurate and culturally sensitive machine translation of literary works.

[60] Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning

Julia Witte Zimmerman, Denis Hudon, Kathryn Cramer, Alejandro J. Ruiz, Calla Beauregard, Ashley Fehr, Mikaela Irene Fudolig, Bradford Demarest, Yoshi Meke Bird, Milo Z. Trujillo, Christopher M. Danforth, Peter Sheridan Dodds

Main category: cs.CL

TL;DR: Tokenization significantly impacts LLM cognition by affecting semantic building blocks, distributional pattern access, and introducing bias, despite being overlooked in current architectures.

Details

Motivation: To examine how tokenization affects LLM performance and cognition, particularly its role in creating semantic primitives and conveying distributional patterns, while identifying potential biases introduced through tokenization.

Method: Analysis of BPE tokenizer outputs, existing model vocabularies from Hugging Face and tiktoken, and examination of token vectors through RoBERTa (large) model layers.

Result: Tokenization creates sub-optimal semantic building blocks, obscures access to distributional patterns, acts as a backdoor for bias, and the tokenization algorithm’s objective function impacts LLM cognition.

Conclusion: Current linguistically-agnostic tokenization techniques need improvement to better handle semantic units and distributional patterns, while addressing bias issues that alignment practices may not remediate.

Abstract: Tokenization is a necessary component within the current architecture of many language models, including the transformer-based large language models (LLMs) of Generative AI, yet its impact on the model’s cognition is often overlooked. We argue that LLMs demonstrate that the Distributional Hypothesis (DH) is sufficient for reasonably human-like language performance, and that the emergence of human-meaningful linguistic units among tokens and current structural constraints motivate changes to existing, linguistically-agnostic tokenization techniques, particularly with respect to their roles as (1) semantic primitives and as (2) vehicles for conveying salient distributional patterns from human language to the model. We explore tokenizations from a BPE tokenizer; extant model vocabularies obtained from Hugging Face and tiktoken; and the information in exemplar token vectors as they move through the layers of a RoBERTa (large) model. Besides creating sub-optimal semantic building blocks and obscuring the model’s access to the necessary distributional patterns, we describe how tokens and pretraining can act as a backdoor for bias and other unwanted content, which current alignment practices may not remediate. Additionally, we relay evidence that the tokenization algorithm’s objective function impacts the LLM’s cognition, despite being arguably meaningfully insulated from the main system intelligence. [First uploaded to arXiv in December, 2024.]

[61] FinMTEB: Finance Massive Text Embedding Benchmark

Yixuan Tang, Yi Yang

Main category: cs.CL

TL;DR: FinMTEB is a financial domain benchmark with 64 datasets across 7 tasks. Domain-adapted models outperform general ones, and surprisingly BoW beats dense embeddings in financial STS tasks.

Details

Motivation: Real-world financial applications require domain-specific evaluation of embedding models, as general benchmarks show limited correlation with financial tasks.

Method: Created FinMTEB benchmark with 64 financial datasets across 7 tasks in Chinese/English. Developed Fin-E5 model using persona-based data synthesis for financial embedding training.

Result: Domain-adapted models consistently outperform general-purpose models. Surprisingly, simple Bag-of-Words approach outperforms dense embeddings in financial Semantic Textual Similarity tasks.

Conclusion: Establishes robust evaluation framework for financial NLP applications and provides insights for developing domain-specific embedding models, highlighting limitations of current dense embedding techniques in financial contexts.

Abstract: Embedding models play a crucial role in representing and retrieving information across various NLP applications. Recent advances in large language models (LLMs) have further enhanced the performance of embedding models. While these models are often benchmarked on general-purpose datasets, real-world applications demand domain-specific evaluation. In this work, we introduce the Finance Massive Text Embedding Benchmark (FinMTEB), a specialized counterpart to MTEB designed for the financial domain. FinMTEB comprises 64 financial domain-specific embedding datasets across 7 tasks that cover diverse textual types in both Chinese and English, such as financial news articles, corporate annual reports, ESG reports, regulatory filings, and earnings call transcripts. We also develop a finance-adapted model, Fin-E5, using a persona-based data synthetic method to cover diverse financial embedding tasks for training. Through extensive evaluation of 15 embedding models, including Fin-E5, we show three key findings: (1) performance on general-purpose benchmarks shows limited correlation with financial domain tasks; (2) domain-adapted models consistently outperform their general-purpose counterparts; and (3) surprisingly, a simple Bag-of-Words (BoW) approach outperforms sophisticated dense embeddings in financial Semantic Textual Similarity (STS) tasks, underscoring current limitations in dense embedding techniques. Our work establishes a robust evaluation framework for financial NLP applications and provides crucial insights for developing domain-specific embedding models.

[62] Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation

Julia Kreutzer, Eleftheria Briakou, Sweta Agrawal, Marzieh Fadaee, Kocmi Tom

Main category: cs.CL

TL;DR: The paper proposes adapting best practices from machine translation evaluation to improve multilingual large language model (mLLM) evaluation, addressing current shortcomings in comprehensiveness, rigor, and standardization.

Details

Motivation: Current evaluation practices for multilingual large language models lack comprehensiveness, scientific rigor, and consistent adoption, which undermines their ability to meaningfully guide mLLM development.

Method: The authors draw parallels with machine translation evaluation practices, conduct targeted experiments across key stages of the generative evaluation pipeline, and identify essential components for robust meta-evaluation of mLLMs.

Result: The research demonstrates how best practices from MT evaluation can deepen understanding of quality differences between models and provides actionable recommendations.

Conclusion: The paper distills insights into a checklist of actionable recommendations for mLLM research and development, promoting more transparent reporting standards and reliable evaluations.

Abstract: Generation capabilities and language coverage of multilingual large language models (mLLMs) are advancing rapidly. However, evaluation practices for generative abilities of mLLMs are still lacking comprehensiveness, scientific rigor, and consistent adoption across research labs, which undermines their potential to meaningfully guide mLLM development. We draw parallels with machine translation (MT) evaluation, a field that faced similar challenges and has, over decades, developed transparent reporting standards and reliable evaluations for multilingual generative models. Through targeted experiments across key stages of the generative evaluation pipeline, we demonstrate how best practices from MT evaluation can deepen the understanding of quality differences between models. Additionally, we identify essential components for robust meta-evaluation of mLLMs, ensuring the evaluation methods themselves are rigorously assessed. We distill these insights into a checklist of actionable recommendations for mLLM research and development.

[63] Alignment-Augmented Speculative Decoding with Alignment Sampling and Conditional Verification

Jikai Wang, Zhenxu Tian, Juntao Li, Qingrong Xia, Xinyu Duan, Zhefeng Wang, Baoxing Huai, Min Zhang

Main category: cs.CL

TL;DR: Training-free speculative decoding method using alignment sampling and flexible verification to accelerate LLM generation without training costs

Details

Motivation: Existing speculative decoding methods require expensive training to achieve draft-target alignment, which is costly and limits practical deployment

Method: Proposes alignment sampling using prefilling phase distribution for better draft candidates, and flexible verification with adaptive probability threshold to handle non-aligned but high-quality drafts

Result: Achieves 3.3 point average generation score improvement for LLaMA3, mean acceptance length of 2.39, and 2.23x speedup across 8 datasets

Conclusion: Training-free approach successfully improves both generation quality and inference efficiency without the training overhead of existing methods

Abstract: Recent works have revealed the great potential of speculative decoding in accelerating the autoregressive generation process of large language models. The success of these methods relies on the alignment between draft candidates and the sampled outputs of the target model. Existing methods mainly achieve draft-target alignment with training-based methods, e.g., EAGLE, Medusa, involving considerable training costs. In this paper, we present a training-free alignment-augmented speculative decoding algorithm. We propose alignment sampling, which leverages output distribution obtained in the prefilling phase to provide more aligned draft candidates. To further benefit from high-quality but non-aligned draft candidates, we also introduce a simple yet effective flexible verification strategy. Through an adaptive probability threshold, our approach can improve generation accuracy while further improving inference efficiency. Experiments on 8 datasets (including question answering, summarization and code completion tasks) show that our approach increases the average generation score by 3.3 points for the LLaMA3 model. Our method achieves a mean acceptance length up to 2.39 and speed up generation by 2.23.

[64] Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models

Zahraa Al Sahili, Ioannis Patras, Matthew Purver

Main category: cs.CL

TL;DR: Multilingual CLIP models exhibit stronger gender bias than English-only versions, with bias patterns varying by language resource level and gender marking systems. Cross-lingual weight sharing transfers English stereotypes to gender-neutral languages.

Details

Motivation: To systematically audit social biases in multilingual vision-language models, as their biases remain underexplored despite promises of universal image-text retrieval.

Method: Used balanced subsets of FairFace and PATA stereotype suite in zero-shot setting to quantify race and gender bias and measure stereotype amplification across four multilingual CLIP variants covering ten languages with varying resource availability and morphological gender marking.

Result: Every model showed stronger gender bias than English-only baseline. Low-resource languages exhibited largest biases, shared encoders transferred English stereotypes to gender-neutral languages, and highly gendered languages consistently magnified all bias types. SigLIP-2 reduced some biases but amplified crime associations in caption-sparse contexts.

Conclusion: Aggregated metrics mask language-specific bias hotspots, highlighting the need for fine-grained, language-aware bias evaluation in multilingual VLM research to address cross-lingual stereotype transfer and amplification.

Abstract: Multilingual vision-language models (VLMs) promise universal image-text retrieval, yet their social biases remain underexplored. We perform the first systematic audit of four public multilingual CLIP variants: M-CLIP, NLLB-CLIP, CAPIVARA-CLIP, and the debiased SigLIP-2, covering ten languages that differ in resource availability and morphological gender marking. Using balanced subsets of FairFace and the PATA stereotype suite in a zero-shot setting, we quantify race and gender bias and measure stereotype amplification. Contrary to the intuition that multilinguality mitigates bias, every model exhibits stronger gender skew than its English-only baseline. CAPIVARA-CLIP shows its largest biases precisely in the low-resource languages it targets, while the shared encoder of NLLB-CLIP and SigLIP-2 transfers English gender stereotypes into gender-neutral languages; loosely coupled encoders largely avoid this leakage. Although SigLIP-2 reduces agency and communion skews, it inherits – and in caption-sparse contexts (e.g., Xhosa) amplifies – the English anchor’s crime associations. Highly gendered languages consistently magnify all bias types, yet gender-neutral languages remain vulnerable whenever cross-lingual weight sharing imports foreign stereotypes. Aggregated metrics thus mask language-specific hot spots, underscoring the need for fine-grained, language-aware bias evaluation in future multilingual VLM research.

[65] Humans Hallucinate Too: Language Models Identify and Correct Subjective Annotation Errors With Label-in-a-Haystack Prompts

Georgios Chochlakis, Peter Wu, Arjun Bedi, Marcus Ma, Kristina Lerman, Shrikanth Narayanan

Main category: cs.CL

TL;DR: LLM-based framework for subjective label verification and correction that distinguishes legitimate semantic variation from annotation errors by analyzing when models fail to copy reference labels in context.

Details

Motivation: Addressing annotation variability in subjective NLP tasks where human disagreement reflects legitimate semantic interpretation differences rather than noise, requiring methods to distinguish subjectivity from error.

Method: Proposes Label-in-a-Haystack setting where LLMs are shown query+label pairs and prompted to predict labels again; failure to copy labels indicates task-relevant information. Introduces LiaHR framework that assigns generated labels when model outputs diverge from reference labels.

Result: Comprehensive analyses, human evaluations, and ecological validity studies verify the utility of LiaHR for label correction, showing it can enhance signal-to-noise ratios in annotation pipelines.

Conclusion: The proposed framework effectively distinguishes between legitimate subjectivity and annotation errors in subjective NLP tasks, providing a practical approach for improving label quality through LLM-based verification and correction.

Abstract: Modeling complex subjective tasks in Natural Language Processing, such as recognizing emotion and morality, is considerably challenging due to significant variation in human annotations. This variation often reflects reasonable differences in semantic interpretations rather than mere noise, necessitating methods to distinguish between legitimate subjectivity and error. We address this challenge by exploring label verification in these contexts using Large Language Models (LLMs). First, we propose a simple In-Context Learning binary filtering baseline that estimates the reasonableness of a document-label pair. We then introduce the Label-in-a-Haystack setting: the query and its label(s) are included in the demonstrations shown to LLMs, which are prompted to predict the label(s) again, while receiving task-specific instructions (e.g., emotion recognition) rather than label copying. We show how the failure to copy the label(s) to the output of the LLM are task-relevant and informative. Building on this, we propose the Label-in-a-Haystack Rectification (LiaHR) framework for subjective label correction: when the model outputs diverge from the reference gold labels, we assign the generated labels to the example instead of discarding it. This approach can be integrated into annotation pipelines to enhance signal-to-noise ratios. Comprehensive analyses, human evaluations, and ecological validity studies verify the utility of LiaHR for label correction. Code is available at https://github.com/gchochla/liahr.

[66] NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities

Abdellah El Mekki, Houdaifa Atou, Omer Nacar, Shady Shehata, Muhammad Abdul-Mageed

Main category: cs.CL

TL;DR: This paper proposes a methodology to create culturally-aligned pre-training data for low-resource languages, developing NileChat - a 3B parameter LLM for Egyptian and Moroccan communities that outperforms similar-sized models and matches larger ones.

Details

Motivation: Current LLM research relies on translating English corpora, which results in models aligned with source language culture rather than representing local cultural heritage and values of underrepresented communities.

Method: Proposes methodology to create both synthetic and retrieval-based pre-training data tailored to specific communities considering language, cultural heritage, and cultural values. Developed NileChat for Egyptian and Moroccan dialects as testbeds.

Result: NileChat outperforms existing Arabic-aware LLMs of similar size and performs on par with larger models on understanding, translation, and cultural/values alignment benchmarks.

Conclusion: The approach successfully enhances linguistic capabilities while preserving cultural identity, and the methods/data/models are shared to promote inclusion of diverse communities in LLM development.

Abstract: Enhancing the linguistic capabilities of Large Language Models (LLMs) to include low-resource languages is a critical research area. Current research directions predominantly rely on synthetic data generated by translating English corpora, which, while demonstrating promising linguistic understanding and translation abilities, often results in models aligned with source language culture. These models frequently fail to represent the cultural heritage and values of local communities. This work proposes a methodology to create both synthetic and retrieval-based pre-training data tailored to a specific community, considering its (i) language, (ii) cultural heritage, and (iii) cultural values. We demonstrate our methodology using Egyptian and Moroccan dialects as testbeds, chosen for their linguistic and cultural richness and current underrepresentation in LLMs. As a proof-of-concept, we develop NileChat, a 3B parameter LLM adapted for Egyptian and Moroccan communities, incorporating their language, cultural heritage, and values. Our results on various understanding, translation, and cultural and values alignment benchmarks show that NileChat outperforms existing Arabic-aware LLMs of similar size and performs on par with larger models. We share our methods, data, and models with the community to promote the inclusion and coverage of more diverse communities in LLM development.

[67] Faster and Better LLMs via Latency-Aware Test-Time Scaling

Zili Wang, Tianyu Zhang, Haoli Bai, Lu Hou, Xianzhi Yu, Wulong Liu, Shiming Xiang, Lei Zhu

Main category: cs.CL

TL;DR: Test-Time Scaling (TTS) methods are not latency-optimal despite being compute-optimal. The paper proposes branch-wise and sequence-wise parallelism to achieve latency-optimal TTS, enabling large models to achieve high accuracy within strict time constraints.

Details

Motivation: Existing TTS research has overlooked efficiency from a latency-sensitive perspective. Compute-optimal TTS doesn't always result in the lowest latency in critical scenarios where response time is crucial.

Method: Proposes two key approaches: (1) branch-wise parallelism using multiple concurrent inference branches, and (2) sequence-wise parallelism enabled by speculative decoding. Integrates both approaches with proper computational resource allocation.

Result: Achieved 82.3% accuracy on MATH-500 within 1 minute using a 32B model, and 72.4% accuracy within 10 seconds using a smaller 3B model.

Conclusion: Latency-aware TTS is crucial for delivering both speed and accuracy in latency-sensitive scenarios. The proposed approaches effectively optimize TTS for real-time performance constraints.

Abstract: Test-Time Scaling (TTS) has proven effective in improving the performance of Large Language Models (LLMs) during inference. However, existing research has overlooked the efficiency of TTS from a latency-sensitive perspective. Through a latency-aware evaluation of representative TTS methods, we demonstrate that a compute-optimal TTS does not always result in the lowest latency in scenarios where latency is critical. To address this gap and achieve latency-optimal TTS, we propose two key approaches by optimizing the concurrency configurations: (1) branch-wise parallelism, which leverages multiple concurrent inference branches, and (2) sequence-wise parallelism, enabled by speculative decoding. By integrating these two approaches and allocating computational resources properly to each, our latency-optimal TTS enables a 32B model to reach 82.3% accuracy on MATH-500 within 1 minute and a smaller 3B model to achieve 72.4% within 10 seconds. Our work emphasizes the importance of latency-aware TTS and demonstrates its ability to deliver both speed and accuracy in latency-sensitive scenarios.

[68] MEMOIR: Lifelong Model Editing with Minimal Overwrite and Informed Retention for LLMs

Ke Wang, Yiming Qin, Nikolaos Dimitriadis, Alessandro Favero, Pascal Frossard

Main category: cs.CL

TL;DR: MEMOIR is a scalable framework for lifelong model editing that uses residual memory with sparse activation patterns to efficiently update language models without forgetting previous knowledge or compromising generalization.

Details

Motivation: Real-world language models need frequent updates with new knowledge, but current editing methods struggle with scalability, interference between edits, and maintaining generalization capabilities without retraining.

Method: Uses a residual memory module with sample-dependent sparsification masks to isolate each edit to distinct memory parameters. At inference, compares sparse activation patterns to identify relevant edits for new queries.

Result: Achieves state-of-the-art performance on QA, hallucination correction, and OOD generalization benchmarks for LLaMA-3 and Mistral models, scaling to thousands of sequential edits with minimal forgetting.

Conclusion: MEMOIR provides an effective solution for scalable and reliable lifelong model editing that preserves model capabilities while enabling efficient knowledge updates through sparse memory activation patterns.

Abstract: Language models deployed in real-world systems often require post-hoc updates to incorporate new or corrected knowledge. However, editing such models efficiently and reliably-without retraining or forgetting previous information-remains a major challenge. Existing methods for lifelong model editing either compromise generalization, interfere with past edits, or fail to scale to long editing sequences. We propose MEMOIR, a novel scalable framework that injects knowledge through a residual memory, i.e., a dedicated parameter module, while preserving the core capabilities of the pre-trained model. By sparsifying input activations through sample-dependent masks, MEMOIR confines each edit to a distinct subset of the memory parameters, minimizing interference among edits. At inference, it identifies relevant edits by comparing the sparse activation patterns of new queries to those stored during editing. This enables generalization to rephrased queries by activating only the relevant knowledge while suppressing unnecessary memory activation for unrelated prompts. Experiments on question answering, hallucination correction, and out-of-distribution generalization benchmarks for LLaMA-3 and Mistral backbones demonstrate that MEMOIR achieves state-of-the-art performance across reliability, generalization, and locality metrics, scaling to thousands of sequential edits with minimal forgetting.

[69] Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes

Tyler Loakman, William Thorne, Chenghua Lin

Main category: cs.CL

TL;DR: LLMs struggle to explain complex humor types beyond simple puns, with none reliably explaining all joke forms despite reasoning capabilities.

Details

Motivation: Existing computational humor research focuses mainly on short pun-based jokes, but real-world humor involves more complex forms requiring esoteric knowledge. The study investigates whether LLMs can explain diverse humor types beyond simple puns.

Method: Curated a dataset of 600 jokes across 4 types (heterographic/homographic puns, internet humor, topical jokes) with manual explanations. Tested zero-shot abilities of various LLMs to accurately explain different joke types.

Result: None of the tested models (including reasoning models) could reliably generate adequate explanations for all joke types. Models performed poorly on complex humor requiring real-world knowledge.

Conclusion: Current LLMs have limited humor explanation capabilities, highlighting the narrow focus of existing research on overly simple joke forms and the need for better models that can handle complex, knowledge-dependent humor.

Abstract: Humour, as a complex language form, is derived from myriad aspects of life. Whilst existing work on computational humour has focussed almost exclusively on short pun-based jokes, we investigate whether the ability of Large Language Models (LLMs) to explain humour depends on the particular form. We compare models’ joke explanation abilities from simple puns to complex topical humour that requires esoteric knowledge of real-world entities and events. To this end, we curate a dataset of 600 jokes across 4 joke types and manually write high-quality explanations. These jokes include heterographic and homographic puns, contemporary internet humour, and topical jokes. Using this dataset, we compare the zero-shot abilities of a range of LLMs to accurately and comprehensively explain jokes of different types, identifying key research gaps in the task of humour explanation. We find that none of the tested models (including reasoning models) are capable of reliably generating adequate explanations of all joke types, further highlighting the narrow focus of most existing works on overly simple joke forms.

[70] Reframe Your Life Story: Interactive Narrative Therapist and Innovative Moment Assessment with Large Language Models

Yi Feng, Jiaqi Wang, Wenxuan Zhang, Zhuang Chen, Yutong Shen, Xiyao Xiao, Minlie Huang, Liping Jing, Jian Yu

Main category: cs.CL

TL;DR: A framework combining Interactive Narrative Therapist (INT) and Innovative Moment Assessment (IMA) that outperforms standard LLMs in therapeutic quality by simulating expert narrative therapy and tracking therapeutic progress through narrative shifts.

Details

Motivation: Current LLM-based mental health approaches lack realism in psychotherapy simulation and fail to capture therapeutic progression over time. Narrative therapy is underutilized due to limited access and social stigma.

Method: Two-component framework: INT simulates expert narrative therapists through therapeutic stage planning and contextually appropriate responses. IMA evaluates effectiveness by tracking “Innovative Moments” (narrative shifts) in client speech.

Result: INT consistently outperforms standard LLMs in therapeutic quality and depth across 260 simulated clients and 230 human participants. Also effective in synthesizing high-quality support conversations.

Conclusion: The proposed framework successfully addresses limitations of current LLM approaches by providing realistic narrative therapy simulation and objective progress tracking, enabling better mental health support applications.

Abstract: Recent progress in large language models (LLMs) has opened new possibilities for mental health support, yet current approaches lack realism in simulating specialized psychotherapy and fail to capture therapeutic progression over time. Narrative therapy, which helps individuals transform problematic life stories into empowering alternatives, remains underutilized due to limited access and social stigma. We address these limitations through a comprehensive framework with two core components. First, INT (Interactive Narrative Therapist) simulates expert narrative therapists by planning therapeutic stages, guiding reflection levels, and generating contextually appropriate expert-like responses. Second, IMA (Innovative Moment Assessment) provides a therapy-centric evaluation method that quantifies effectiveness by tracking “Innovative Moments” (IMs), critical narrative shifts in client speech signaling therapy progress. Experimental results on 260 simulated clients and 230 human participants reveal that INT consistently outperforms standard LLMs in therapeutic quality and depth. We further demonstrate the effectiveness of INT in synthesizing high-quality support conversations to facilitate social applications.

[71] Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments

Junjie Ye, Changhao Jiang, Zhengyin Du, Yufei Xu, Xuesong Yao, Zhiheng Xi, Xiaoran Fan, Qi Zhang, Tao Gui, Xuanjing Huang, Jiecao Chen

Main category: cs.CL

TL;DR: Automated pipeline for creating RL training environments and verifiable reward mechanisms to enhance LLM tool-use performance without degrading general capabilities.

Details

Motivation: Progress in LLM tool use is limited by lack of efficient RL frameworks due to challenges in stable training environment construction and verifiable reward design.

Method: Proposed automated environment construction pipeline with scenario decomposition, document generation, function integration, complexity scaling, and localized deployment, plus verifiable reward mechanism evaluating tool precision and task completeness.

Result: Significantly enhanced tool-use performance across various LLM scales without degrading general capabilities, regardless of inference modes or training algorithms.

Conclusion: The approach improves context understanding and reasoning through updates to lower-layer MLP parameters, providing an effective framework for RL-based tool-use training.

Abstract: Effective tool use is essential for large language models (LLMs) to interact meaningfully with their environment. However, progress is limited by the lack of efficient reinforcement learning (RL) frameworks specifically designed for tool use, due to challenges in constructing stable training environments and designing verifiable reward mechanisms. To address this, we propose an automated environment construction pipeline, incorporating scenario decomposition, document generation, function integration, complexity scaling, and localized deployment. This enables the creation of high-quality training environments that provide detailed and measurable feedback without relying on external tools. Additionally, we introduce a verifiable reward mechanism that evaluates both the precision of tool use and the completeness of task execution. When combined with trajectory data collected from the constructed environments, this mechanism integrates seamlessly with standard RL algorithms to facilitate feedback-driven model training. Experiments on LLMs of varying scales demonstrate that our approach significantly enhances the models’ tool-use performance without degrading their general capabilities, regardless of inference modes or training algorithms. Our analysis suggests that these gains result from improved context understanding and reasoning, driven by updates to the lower-layer MLP parameters in models.

[72] Decoding Neural Emotion Patterns through Large Language Model Embeddings

Gideon Vos, Maryam Ebrahimpour, Liza van Eijk, Zoltan Sarnyai, Mostafa Rahimi Azghadi

Main category: cs.CL

TL;DR: A computational framework that maps textual emotional content to brain regions without neuroimaging, using text embeddings and clustering to analyze emotional patterns in different populations and compare human vs AI emotional expression.

Details

Motivation: Traditional neuroimaging is expensive and lab-bound, while abundant digital text offers new opportunities for emotion-brain mapping. Previous work has examined neuroimaging-based emotion localization and computational text analysis separately with little integration.

Method: Using OpenAI’s text-embedding-ada-002 to generate semantic representations, applying dimensionality reduction and clustering to identify emotional groups, and mapping them to 18 brain regions associated with emotional processing. Three experiments: analyzing conversational data from healthy vs depressed subjects, applying to GoEmotions dataset, and comparing human-written text with LLM responses.

Result: Neuroanatomically plausible mappings with high spatial specificity. Depressed subjects showed greater limbic engagement for negative affect. Discrete emotions were successfully differentiated. LLM-generated text matched humans in basic emotion distribution but lacked nuanced activation in empathy and self-referential regions.

Conclusion: This cost-effective, scalable approach enables large-scale analysis of naturalistic language, distinguishes clinical populations, and provides a brain-based benchmark for evaluating AI emotional expression.

Abstract: Understanding how emotional expression in language relates to brain function is a challenge in computational neuroscience and affective computing. Traditional neuroimaging is costly and lab-bound, but abundant digital text offers new avenues for emotion-brain mapping. Prior work has largely examined neuroimaging-based emotion localization or computational text analysis separately, with little integration. We propose a computational framework that maps textual emotional content to anatomically defined brain regions without requiring neuroimaging. Using OpenAI’s text-embedding-ada-002, we generate high-dimensional semantic representations, apply dimensionality reduction and clustering to identify emotional groups, and map them to 18 brain regions linked to emotional processing. Three experiments were conducted: i) analyzing conversational data from healthy vs. depressed subjects (DIAC-WOZ dataset) to compare mapping patterns, ii) applying the method to the GoEmotions dataset and iii) comparing human-written text with large language model (LLM) responses to assess differences in inferred brain activation. Emotional intensity was scored via lexical analysis. Results showed neuroanatomically plausible mappings with high spatial specificity. Depressed subjects exhibited greater limbic engagement tied to negative affect. Discrete emotions were successfully differentiated. LLM-generated text matched humans in basic emotion distribution but lacked nuanced activation in empathy and self-referential regions (medial prefrontal and posterior cingulate cortex). This cost-effective, scalable approach enables large-scale analysis of naturalistic language, distinguishes between clinical populations, and offers a brain-based benchmark for evaluating AI emotional expression.

[73] Can Large Language Models Master Complex Card Games?

Wei Wang, Felix Henry, Junzhe Chen, Dan Zhang, Shiyu Huang, Evgeny Kharlamov, Jie Tang

Main category: cs.CL

TL;DR: LLMs can master complex card games through fine-tuning on gameplay data, achieving performance comparable to strong game AIs while showing both positive transfer for similar games and conflicts for dissimilar ones, though with some decline in general capabilities that can be mitigated.

Details

Motivation: To explore whether large language models (LLMs) can achieve similar success in complex games as specialized AI systems like AlphaGo and AlphaZero, particularly in the domain of card games.

Method: Systematically assessed LLMs across eight diverse card games through fine-tuning on high-quality gameplay data, evaluating performance impact and general capability retention.

Result: LLMs approached strong game AI performance through supervised fine-tuning, mastered multiple complex card games simultaneously with performance augmentation for similar games and conflicts for dissimilar ones, and experienced general capability decline that could be mitigated with general instruction data.

Conclusion: LLMs demonstrate strong learning ability and versatility in mastering complex card games, showing potential for game AI applications while revealing trade-offs between specialized performance and general capabilities.

Abstract: Complex games have long been an important benchmark for testing the progress of artificial intelligence algorithms. AlphaGo, AlphaZero, and MuZero have defeated top human players in Go and Chess, garnering widespread societal attention towards artificial intelligence. Concurrently, large language models (LLMs) have exhibited remarkable capabilities across various tasks, raising the question of whether LLMs can achieve similar success in complex games. In this paper, we explore the potential of LLMs in mastering complex card games. We systematically assess the learning capabilities of LLMs across eight diverse card games, evaluating the impact of fine-tuning on high-quality gameplay data, and examining the models’ ability to retain general capabilities while mastering these games. Our findings indicate that: (1) LLMs can approach the performance of strong game AIs through supervised fine-tuning on high-quality data, (2) LLMs can master multiple complex card games simultaneously, with performance augmentation for games with similar rules and conflicts for dissimilar ones, and (3) LLMs experience a decline in general capabilities when mastering complex games, but this decline can be mitigated by integrating a certain amount of general instruction data. The evaluation results demonstrate strong learning ability and versatility of LLMs.

[74] MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining

Haoyu Dong, Pengkun Zhang, Mingzhe Lu, Yanzhen Shen, Guolin Ke

Main category: cs.CL

TL;DR: MachineLearningLM is a framework that enhances LLMs’ in-context learning for ML tasks through continued pretraining on synthetic data from structural causal models, enabling strong many-shot scaling while preserving general capabilities.

Details

Motivation: Large language models struggle to learn from many in-context examples on standard machine learning tasks without gradient descent, limiting their practical application in ML workflows.

Method: Continued pretraining framework using synthetic ML tasks from millions of structural causal models, with random-forest teacher distillation and token-efficient prompting for batch inference.

Result: Outperforms strong LLM baselines by ~15% on out-of-distribution tabular classification across multiple domains, shows monotonic accuracy improvement from 8 to 1,024 shots, and achieves 75.4% on MMLU while preserving general capabilities.

Conclusion: MachineLearningLM successfully equips general-purpose LLMs with robust in-context ML capability while maintaining their broad knowledge and reasoning abilities, demonstrating effective many-shot scaling for practical applications.

Abstract: Large language models (LLMs) possess broad world knowledge and strong general-purpose reasoning ability, yet they struggle to learn from many in-context examples on standard machine learning (ML) tasks, that is, to leverage many-shot demonstrations purely via in-context learning (ICL) without gradient descent. We introduce MachineLearningLM, a portable continued-pretraining framework that equips a general-purpose LLM with robust in-context ML capability while preserving its general knowledge and reasoning for broader chat workflows. Our pretraining procedure synthesizes ML tasks from millions of structural causal models (SCMs), spanning shot counts up to 1,024. We begin with a random-forest teacher, distilling tree-based decision strategies into the LLM to strengthen robustness in numerical modeling. All tasks are serialized with a token-efficient prompt, enabling 3x to 6x more examples per context window and delivering up to 50x amortized throughput via batch inference. Despite a modest setup (Qwen-2.5-7B-Instruct with LoRA rank 8), MachineLearningLM outperforms strong LLM baselines (e.g., GPT-5-mini) by an average of about 15% on out-of-distribution tabular classification across finance, physics, biology, and healthcare domains. It exhibits a striking many-shot scaling law: accuracy increases monotonically as in-context demonstrations grow from 8 to 1,024. Without any task-specific training, it attains random-forest-level accuracy across hundreds of shots. General chat capabilities, including knowledge and reasoning, are preserved: it achieves 75.4% on MMLU.

[75] Parallel-R1: Towards Parallel Thinking via Reinforcement Learning

Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, Dong Yu

Main category: cs.CL

TL;DR: Parallel-R1 is a reinforcement learning framework that trains LLMs for parallel thinking (exploring multiple reasoning paths concurrently) using a progressive curriculum, achieving significant accuracy improvements on math benchmarks.

Details

Motivation: Existing methods rely on supervised fine-tuning which encourages imitation rather than exploration. There's a need for RL-based training to enable genuine parallel thinking capabilities for complex reasoning tasks.

Method: Progressive curriculum: first use SFT on easier tasks to instill parallel thinking, then transition to RL to explore and generalize on harder problems. Addresses cold-start problem in RL training.

Result: 8.4% accuracy improvement over sequential thinking models on challenging tasks. 42.9% improvement over baseline on AIME25. Model shifts from using parallel thinking for exploration to multi-perspective verification.

Conclusion: Parallel thinking serves as a mid-training exploration scaffold that unlocks higher performance ceilings after RL training, demonstrating successful instillation of parallel thinking capabilities in LLMs.

Abstract: Parallel thinking has emerged as a novel approach for enhancing the reasoning capabilities of large language models (LLMs) by exploring multiple reasoning paths concurrently. However, activating such capabilities through training remains challenging, as existing methods predominantly rely on supervised fine-tuning (SFT) over synthetic data, which encourages teacher-forced imitation rather than exploration and generalization. Different from them, we propose \textbf{Parallel-R1}, the first reinforcement learning (RL) framework that enables parallel thinking behaviors for complex real-world reasoning tasks. Our framework employs a progressive curriculum that explicitly addresses the cold-start problem in training parallel thinking with RL. We first use SFT on prompt-generated trajectories from easier tasks to instill the parallel thinking ability, then transition to RL to explore and generalize this skill on harder problems. Experiments on various math benchmarks, including MATH, AMC23, and AIME, show that Parallel-R1 successfully instills parallel thinking, leading to 8.4% accuracy improvements over the sequential thinking model trained directly on challenging tasks with RL. Further analysis reveals a clear shift in the model’s thinking behavior: at an early stage, it uses parallel thinking as an exploration strategy, while in a later stage, it uses the same capability for multi-perspective verification. Most significantly, we validate parallel thinking as a \textbf{mid-training exploration scaffold}, where this temporary exploratory phase unlocks a higher performance ceiling after RL, yielding a 42.9% improvement over the baseline on AIME25. Our model, data, and code will be open-source at https://github.com/zhengkid/Parallel-R1.

cs.CV

[76] Australian Supermarket Object Set (ASOS): A Benchmark Dataset of Physical Objects and 3D Models for Robotics and Computer Vision

Akansel Cosgun, Lachlan Chumbley, Benjamin J. Meyer

Main category: cs.CV

TL;DR: ASOS is a new 3D dataset of 50 common Australian supermarket items with high-quality textured meshes for robotics and computer vision benchmarking.

Details

Motivation: Existing datasets use synthetic models or specialized objects with limited accessibility, so there's a need for cost-effective, real-world objects that are readily available.

Method: Used structure-from-motion techniques with high-resolution imaging to generate watertight 3D meshes of 50 supermarket items across 10 categories.

Result: Created a comprehensive dataset with diverse shapes, sizes, and weights that can be easily sourced from a major Australian supermarket chain.

Conclusion: ASOS provides valuable benchmarking capabilities for object detection, pose estimation, and robotics applications due to its emphasis on accessibility and real-world applicability.

Abstract: This paper introduces the Australian Supermarket Object Set (ASOS), a comprehensive dataset comprising 50 readily available supermarket items with high-quality 3D textured meshes designed for benchmarking in robotics and computer vision applications. Unlike existing datasets that rely on synthetic models or specialized objects with limited accessibility, ASOS provides a cost-effective collection of common household items that can be sourced from a major Australian supermarket chain. The dataset spans 10 distinct categories with diverse shapes, sizes, and weights. 3D meshes are acquired by a structure-from-motion techniques with high-resolution imaging to generate watertight meshes. The dataset’s emphasis on accessibility and real-world applicability makes it valuable for benchmarking object detection, pose estimation, and robotics applications.

[77] A Multimodal RAG Framework for Housing Damage Assessment: Collaborative Optimization of Image Encoding and Policy Vector Retrieval

Jiayi Miao, Dingxin Lu, Zhuqi Wang

Main category: cs.CV

TL;DR: A multimodal RAG framework for post-disaster housing damage assessment that combines visual building damage analysis with text retrieval from insurance policies, achieving 9.6% improvement in retrieval accuracy.

Details

Motivation: Accurate housing damage evaluations after natural disasters are crucial for insurance claims and resource planning, requiring effective integration of visual damage assessment with policy information retrieval.

Method: Two-branch multimodal encoder with ResNet+Transformer visual branch for building damage analysis and BERT text branch for policy retrieval. Includes cross-modal interaction via multi-head attention and modal attention gating for dynamic evidence control. End-to-end training with multi-task loss optimization.

Result: Superior performance in retrieval accuracy and damage severity classification, with Top-1 retrieval accuracy improved by 9.6%.

Conclusion: The MM-RAG framework effectively bridges visual damage assessment and policy information retrieval, providing a comprehensive solution for post-disaster insurance response through collaborative multimodal learning.

Abstract: After natural disasters, accurate evaluations of damage to housing are important for insurance claims response and planning of resources. In this work, we introduce a novel multimodal retrieval-augmented generation (MM-RAG) framework. On top of classical RAG architecture, we further the framework to devise a two-branch multimodal encoder structure that the image branch employs a visual encoder composed of ResNet and Transformer to extract the characteristic of building damage after disaster, and the text branch harnesses a BERT retriever for the text vectorization of posts as well as insurance policies and for the construction of a retrievable restoration index. To impose cross-modal semantic alignment, the model integrates a cross-modal interaction module to bridge the semantic representation between image and text via multi-head attention. Meanwhile, in the generation module, the introduced modal attention gating mechanism dynamically controls the role of visual evidence and text prior information during generation. The entire framework takes end-to-end training, and combines the comparison loss, the retrieval loss and the generation loss to form multi-task optimization objectives, and achieves image understanding and policy matching in collaborative learning. The results demonstrate superior performance in retrieval accuracy and classification index on damage severity, where the Top-1 retrieval accuracy has been improved by 9.6%.

[78] Improving MLLM Historical Record Extraction with Test-Time Image

Taylor Archibald, Tony Martinez

Main category: cs.CV

TL;DR: A novel ensemble framework using Gemini 2.0 Flash and custom alignment improves LLM-based text extraction from noisy historical documents by 4 percentage points over baseline.

Details

Motivation: To stabilize text extraction from noisy historical documents where single-shot LLM transcription may be unreliable due to document degradation and noise.

Method: Transcribe multiple augmented variants of each document image with Gemini 2.0 Flash, then fuse outputs using a custom Needleman-Wunsch style aligner to produce consensus transcription with confidence scores.

Result: 4 percentage point accuracy improvement over single-shot baseline on a new dataset of 622 Pennsylvania death records; padding and blurring most effective for accuracy, grid warp best for confidence separation.

Conclusion: The approach is simple, scalable, and immediately deployable to other document collections and transcription models for improved historical document processing.

Abstract: We present a novel ensemble framework that stabilizes LLM based text extraction from noisy historical documents. We transcribe multiple augmented variants of each image with Gemini 2.0 Flash and fuse these outputs with a custom Needleman Wunsch style aligner that yields both a consensus transcription and a confidence score. We present a new dataset of 622 Pennsylvania death records, and demonstrate our method improves transcription accuracy by 4 percentage points relative to a single shot baseline. We find that padding and blurring are the most useful for improving accuracy, while grid warp perturbations are best for separating high and low confidence cases. The approach is simple, scalable, and immediately deployable to other document collections and transcription models.

[79] MITS: A Large-Scale Multimodal Benchmark Dataset for Intelligent Traffic Surveillance

Kaikai Zhao, Zhaoxiang Liu, Peng Wang, Xin Wang, Zhicheng Ma, Yajun Xu, Wenjing Zhang, Yibing Nan, Kai Wang, Shiguo Lian

Main category: cs.CV

TL;DR: MITS is the first large-scale multimodal benchmark dataset for Intelligent Traffic Surveillance, containing 170,400 real-world images with comprehensive annotations and 5M QA pairs, which significantly improves LMM performance in ITS applications.

Details

Motivation: General-domain LMMs underperform in ITS due to lack of dedicated multimodal datasets, creating a need for domain-specific resources to advance traffic surveillance capabilities.

Method: Created MITS dataset with 170,400 real ITS images annotated with 8 main categories and 24 subcategories, plus generated high-quality captions and 5M instruction-following QA pairs covering five critical ITS tasks.

Result: Fine-tuning on MITS dramatically improved LMM performance: LLaVA-1.5 increased from 0.494 to 0.905 (+83.2%), LLaVA-1.6 from 0.678 to 0.921 (+35.8%), Qwen2-VL from 0.584 to 0.926 (+58.6%), and Qwen2.5-VL from 0.732 to 0.930 (+27.0%).

Conclusion: MITS effectively bridges the gap in ITS multimodal data, enabling development of ITS-specific applications and significantly advancing both ITS and LMM research through open-source release of dataset, code, and models.

Abstract: General-domain large multimodal models (LMMs) have achieved significant advances in various image-text tasks. However, their performance in the Intelligent Traffic Surveillance (ITS) domain remains limited due to the absence of dedicated multimodal datasets. To address this gap, we introduce MITS (Multimodal Intelligent Traffic Surveillance), the first large-scale multimodal benchmark dataset specifically designed for ITS. MITS includes 170,400 independently collected real-world ITS images sourced from traffic surveillance cameras, annotated with eight main categories and 24 subcategories of ITS-specific objects and events under diverse environmental conditions. Additionally, through a systematic data generation pipeline, we generate high-quality image captions and 5 million instruction-following visual question-answer pairs, addressing five critical ITS tasks: object and event recognition, object counting, object localization, background analysis, and event reasoning. To demonstrate MITS’s effectiveness, we fine-tune mainstream LMMs on this dataset, enabling the development of ITS-specific applications. Experimental results show that MITS significantly improves LMM performance in ITS applications, increasing LLaVA-1.5’s performance from 0.494 to 0.905 (+83.2%), LLaVA-1.6’s from 0.678 to 0.921 (+35.8%), Qwen2-VL’s from 0.584 to 0.926 (+58.6%), and Qwen2.5-VL’s from 0.732 to 0.930 (+27.0%). We release the dataset, code, and models as open-source, providing high-value resources to advance both ITS and LMM research.

[80] Decomposing Visual Classification: Assessing Tree-Based Reasoning in VLMs

Sary Elmansoury, Islam Mesabah, Gerrit Großmann, Peter Neigel, Raj Bhalwankar, Daniel Kondermann, Sebastian J. Vollmer

Main category: cs.CV

TL;DR: Tree-based reasoning for VLMs underperforms standard zero-shot prompting despite achieving high understanding of tree knowledge, though image descriptions can enhance both methods.

Details

Motivation: To investigate whether structured, tree-based reasoning can enhance vision language model performance on fine-grained tasks and large hierarchical label spaces.

Method: A framework that decomposes classification into interpretable decisions using decision trees, evaluated on fine-grained (GTSRB) and coarse-grained (CIFAR-10) datasets, with exploration of LLM-generated classes and image descriptions for prompt enhancement.

Result: Tree-based reasoning consistently underperformed standard zero-shot prompting despite 98.2% accuracy in tree knowledge understanding. Adding image descriptions improved performance for both tree-based and zero-shot methods.

Conclusion: Structured reasoning has limitations in visual classification, but findings provide insights for designing more interpretable VLM systems.

Abstract: Vision language models (VLMs) excel at zero-shot visual classification, but their performance on fine-grained tasks and large hierarchical label spaces is understudied. This paper investigates whether structured, tree-based reasoning can enhance VLM performance. We introduce a framework that decomposes classification into interpretable decisions using decision trees and evaluates it on fine-grained (GTSRB) and coarse-grained (CIFAR-10) datasets. Although the model achieves 98.2% accuracy in understanding the tree knowledge, tree-based reasoning consistently underperforms standard zero-shot prompting. We also explore enhancing the tree prompts with LLM-generated classes and image descriptions to improve alignment. The added description enhances the performance of the tree-based and zero-shot methods. Our findings highlight limitations of structured reasoning in visual classification and offer insights for designing more interpretable VLM systems.

[81] World Modeling with Probabilistic Structure Integration

Klemen Kotar, Wanhee Lee, Rahul Venkatesh, Honglin Chen, Daniel Bear, Jared Watrous, Simon Kim, Khai Loong Aw, Lilian Naing Chen, Stefan Stojanov, Kevin Feigelis, Imran Thobani, Alex Durango, Khaled Jedoui, Atlas Kazemian, Dan Yamins

Main category: cs.CV

TL;DR: PSI is a system that learns controllable world models from video data through a 3-step cycle: probabilistic prediction, structure extraction, and integration, enabling improved video understanding and generation capabilities.

Details

Motivation: To create richly controllable and flexibly promptable world models that can extract meaningful structures from data and use them for improved prediction and understanding tasks.

Method: Three-step cycle: 1) Build probabilistic graphical model (Psi) as random-access autoregressive sequence model, 2) Extract low-dimensional structures via causal inference, 3) Integrate structures as new token types for continual training.

Result: Trained on 1.4T video tokens; achieved state-of-the-art optical flow, self-supervised depth, object segmentation; enabled various video prediction and understanding tasks with improved performance.

Conclusion: PSI creates an LLM-like universal prompting language for world models, where each cycle augments capabilities and creates new control handles for better data modeling.

Abstract: We present Probabilistic Structure Integration (PSI), a system for learning richly controllable and flexibly promptable world models from data. PSI consists of a three-step cycle. The first step, Probabilistic prediction, involves building a probabilistic graphical model Psi of the data, in the form of a random-access autoregressive sequence model. Psi supports a complete set of learned conditional distributions describing the dependence of any variables in the data on any other set of variables. In step 2, Structure extraction, we show how to extract underlying low-dimensional properties in the data, corresponding to a diverse set of meaningful “intermediate structures”, in a zero-shot fashion via causal inference on Psi. Step 3, Integration, completes the cycle by converting these structures into new token types that are then continually mixed back into the training diet as conditioning signals and prediction targets. Each such cycle augments the capabilities of Psi, both allowing it to model the underlying data better, and creating new control handles – akin to an LLM-like universal prompting language. We train an instance of Psi on 1.4 trillion tokens of internet video data; we use it to perform a variety of useful video prediction and understanding inferences; we extract state-of-the-art optical flow, self-supervised depth and object segmentation; and we use these structures to support a full cycle of predictive improvements.

[82] Images in Motion?: A First Look into Video Leakage in Collaborative Deep Learning

Md Fazle Rasul, Alanood Alqobaisi, Bruhadeshwar Bezawada, Indrakshi Ray

Main category: cs.CV

TL;DR: First analysis of video data leakage in federated learning via gradient inversion attacks, showing both raw video frames and feature extractor approaches are vulnerable, with image super-resolution enhancing attack effectiveness.

Details

Motivation: Federated learning's privacy protection is threatened by gradient inversion attacks that can reconstruct private training data, but video data vulnerability remains unexplored despite known risks for other data types.

Method: Evaluated two video classification approaches: pre-trained feature extractors and raw video frame processing with simple transformations. Tested gradient inversion attacks across scenarios with zero, one, or multiple reference frames, using image super-resolution to enhance reconstructed frames.

Result: Feature extractors provide greater resilience but still vulnerable if classifier lacks complexity. Image super-resolution successfully enhances reconstructed video quality. Attacks remain viable even without reference frames.

Conclusion: Video data leakage in federated learning is a real threat that warrants further investigation, as current protection methods are insufficient against determined gradient inversion attacks.

Abstract: Federated learning (FL) allows multiple entities to train a shared model collaboratively. Its core, privacy-preserving principle is that participants only exchange model updates, such as gradients, and never their raw, sensitive data. This approach is fundamental for applications in domains where privacy and confidentiality are important. However, the security of this very mechanism is threatened by gradient inversion attacks, which can reverse-engineer private training data directly from the shared gradients, defeating the purpose of FL. While the impact of these attacks is known for image, text, and tabular data, their effect on video data remains an unexamined area of research. This paper presents the first analysis of video data leakage in FL using gradient inversion attacks. We evaluate two common video classification approaches: one employing pre-trained feature extractors and another that processes raw video frames with simple transformations. Our initial results indicate that the use of feature extractors offers greater resilience against gradient inversion attacks. We also demonstrate that image super-resolution techniques can enhance the frames extracted through gradient inversion attacks, enabling attackers to reconstruct higher-quality videos. Our experiments validate this across scenarios where the attacker has access to zero, one, or more reference frames from the target environment. We find that although feature extractors make attacks more challenging, leakage is still possible if the classifier lacks sufficient complexity. We, therefore, conclude that video data leakage in FL is a viable threat, and the conditions under which it occurs warrant further investigation.

[83] A Co-Training Semi-Supervised Framework Using Faster R-CNN and YOLO Networks for Object Detection in Densely Packed Retail Images

Hossein Yazdanjouei, Arash Mansouri, Mohammad Shokouhifar

Main category: cs.CV

TL;DR: Semi-supervised co-training framework combining Faster R-CNN and YOLO for object detection in retail environments, with ensemble classification and metaheuristic optimization, reducing annotation costs while maintaining high accuracy.

Details

Motivation: Address challenges in densely packed retail environments where limited labeled data, occlusion, overlapping objects, and frequent product/layout changes make traditional object detection difficult and annotation costly.

Method: Co-training framework with Faster R-CNN (ResNet backbone) for precise localization and YOLO (Darknet backbone) for global context, mutual pseudo-label exchange, ensemble classification (XGBoost, Random Forest, SVM), and metaheuristic hyperparameter optimization.

Result: Strong performance demonstrated on SKU-110k dataset, showing improved accuracy in scenes with occlusion and overlapping objects, with reduced reliance on manual labeling.

Conclusion: The framework is scalable and practical for real-world retail applications including automated inventory tracking, product monitoring, and checkout systems, effectively adapting to retail environment challenges.

Abstract: This study proposes a semi-supervised co-training framework for object detection in densely packed retail environments, where limited labeled data and complex conditions pose major challenges. The framework combines Faster R-CNN (utilizing a ResNet backbone) for precise localization with YOLO (employing a Darknet backbone) for global context, enabling mutual pseudo-label exchange that improves accuracy in scenes with occlusion and overlapping objects. To strengthen classification, it employs an ensemble of XGBoost, Random Forest, and SVM, utilizing diverse feature representations for higher robustness. Hyperparameters are optimized using a metaheuristic-driven algorithm, enhancing precision and efficiency across models. By minimizing reliance on manual labeling, the approach reduces annotation costs and adapts effectively to frequent product and layout changes common in retail. Experiments on the SKU-110k dataset demonstrate strong performance, highlighting the scalability and practicality of the proposed framework for real-world retail applications such as automated inventory tracking, product monitoring, and checkout systems.

[84] Purge-Gate: Backpropagation-Free Test-Time Adaptation for Point Clouds Classification via Token Purging

Moslem Yazdanpanah, Ali Bahri, Mehrdad Noori, Sahar Dastani, Gustavo Adolfo Vargas Hakim, David Osowiechi, Ismail Ben Ayed, Christian Desrosiers

Main category: cs.CV

TL;DR: Token Purging (PG) is a novel backpropagation-free test-time adaptation method for 3D point cloud classification that removes domain-shifted tokens before attention layers, achieving superior accuracy and efficiency.

Details

Motivation: To address performance degradation caused by distribution shifts in 3D point cloud classification without requiring iterative updates or backpropagation during test-time adaptation.

Method: Proposes Token Purging (PG) approach with two variants: PG-SP (using source statistics) and PG-SF (fully source-free using CLS-token-driven adaptation). Removes tokens highly affected by domain shifts before they reach attention layers.

Result: PG-SP achieves +10.3% higher accuracy than state-of-the-art backpropagation-free methods. PG-SF sets new benchmarks for source-free adaptation. PG is 12.4x faster and 5.5x more memory efficient than baseline.

Conclusion: Token Purging provides an effective, efficient backpropagation-free solution for test-time adaptation in 3D point cloud classification, making it suitable for real-world deployment with superior performance and resource efficiency.

Abstract: Test-time adaptation (TTA) is crucial for mitigating performance degradation caused by distribution shifts in 3D point cloud classification. In this work, we introduce Token Purging (PG), a novel backpropagation-free approach that removes tokens highly affected by domain shifts before they reach attention layers. Unlike existing TTA methods, PG operates at the token level, ensuring robust adaptation without iterative updates. We propose two variants: PG-SP, which leverages source statistics, and PG-SF, a fully source-free version relying on CLS-token-driven adaptation. Extensive evaluations on ModelNet40-C, ShapeNet-C, and ScanObjectNN-C demonstrate that PG-SP achieves an average of +10.3% higher accuracy than state-of-the-art backpropagation-free methods, while PG-SF sets new benchmarks for source-free adaptation. Moreover, PG is 12.4 times faster and 5.5 times more memory efficient than our baseline, making it suitable for real-world deployment. Code is available at \hyperlink{https://github.com/MosyMosy/Purge-Gate}{https://github.com/MosyMosy/Purge-Gate}

[85] Fine-Grained Cross-View Localization via Local Feature Matching and Monocular Depth Priors

Zimin Xia, Chenghao Xu, Alexandre Alahi

Main category: cs.CV

TL;DR: A novel cross-view localization method that directly matches ground-level image features with aerial imagery using monocular depth priors, supporting both metric and relative depth for accurate 3-DOF pose estimation without extensive supervision.

Details

Motivation: Previous methods transform ground images to bird's-eye view (BEV) representation, which causes information loss due to perspective distortion and height compression, degrading alignment quality with aerial imagery.

Method: Directly establishes correspondences between ground and aerial images, lifts only matched keypoints to BEV space using monocular depth prior. Uses scale-aware Procrustes alignment for pose estimation and optional scale recovery when using relative depth.

Result: Achieves superior localization performance under challenging conditions (cross-area generalization, unknown orientation), learns accurate local feature correspondences with only weak pose supervision, and works with various relative depth models without per-model finetuning.

Conclusion: The method provides highly interpretable and accurate fine-grained cross-view localization with strong real-world deployment potential due to its flexibility and compatibility with different depth models.

Abstract: We propose an accurate and highly interpretable fine-grained cross-view localization method that estimates the 3 Degrees of Freedom pose of a ground-level image by matching its local features with a reference aerial image. Previous methods typically transform the ground image into a bird’s-eye view (BEV) representation and then align it with the aerial image for localization. However, this transformation often leads to information loss due to perspective distortion or compression of height information, thereby degrading alignment quality with the aerial view. In contrast, our method directly establishes correspondences between ground and aerial images and lifts only the matched keypoints to BEV space using monocular depth prior. Notably, modern depth predictors can provide reliable metric depth when the test samples are similar to the training data. When the depth distribution differs, they still produce consistent relative depth, i.e., depth accurate up to an unknown scale. Our method supports both metric and relative depth. It employs a scale-aware Procrustes alignment to estimate the camera pose from the correspondences and optionally recover the scale when using relative depth. Experimental results demonstrate that, with only weak supervision on camera pose, our method learns accurate local feature correspondences and achieves superior localization performance under challenging conditions, such as cross-area generalization and unknown orientation. Moreover, our method is compatible with various relative depth models without requiring per-model finetuning. This flexibility, combined with strong localization performance, makes it well-suited for real-world deployment.

[86] Efficient and Accurate Downfacing Visual Inertial Odometry

Jonas Kühne, Christian Vogt, Michele Magno, Luca Benini

Main category: cs.CV

TL;DR: An efficient Visual Inertial Odometry pipeline optimized for micro/nano-UAVs using quantized feature tracking methods on RISC-V SoCs, achieving 3.65x RMSE reduction.

Details

Motivation: Bridge the gap between high-accuracy VIO pipelines requiring powerful systems and lightweight implementations suitable for microcontrollers on resource-constrained UAV platforms.

Method: Combines state-of-the-art feature detection/tracking (SuperPoint, PX4FLOW, ORB) optimized and quantized for RISC-V SoCs, employs rigid body motion model for error reduction, and implements on ultra-low-power GAP9 SoC for real-world validation.

Result: Achieves average RMSE reduction of up to 3.65x over baseline using ORB tracker. PX4FLOW provides on-par accuracy with ORB at lower runtime for movement speeds below 24 pixels/frame.

Conclusion: The optimized VIO pipeline successfully enables high-accuracy visual odometry on ultra-low-power systems, making it suitable for micro- and nano-UAV applications with computational constraints.

Abstract: Visual Inertial Odometry (VIO) is a widely used computer vision method that determines an agent’s movement through a camera and an IMU sensor. This paper presents an efficient and accurate VIO pipeline optimized for applications on micro- and nano-UAVs. The proposed design incorporates state-of-the-art feature detection and tracking methods (SuperPoint, PX4FLOW, ORB), all optimized and quantized for emerging RISC-V-based ultra-low-power parallel systems on chips (SoCs). Furthermore, by employing a rigid body motion model, the pipeline reduces estimation errors and achieves improved accuracy in planar motion scenarios. The pipeline’s suitability for real-time VIO is assessed on an ultra-low-power SoC in terms of compute requirements and tracking accuracy after quantization. The pipeline, including the three feature tracking methods, was implemented on the SoC for real-world validation. This design bridges the gap between high-accuracy VIO pipelines that are traditionally run on computationally powerful systems and lightweight implementations suitable for microcontrollers. The optimized pipeline on the GAP9 low-power SoC demonstrates an average reduction in RMSE of up to a factor of 3.65x over the baseline pipeline when using the ORB feature tracker. The analysis of the computational complexity of the feature trackers further shows that PX4FLOW achieves on-par tracking accuracy with ORB at a lower runtime for movement speeds below 24 pixels/frame.

[87] Early Detection of Visual Impairments at Home Using a Smartphone Red-Eye Reflex Test

Judith Massmann, Alexander Lichtenstein, Francisco M. López

Main category: cs.CV

TL;DR: KidsVisionCheck is a mobile app that uses AI and smartphone cameras to perform pediatric vision screening via red-eye reflex analysis, achieving 90% accuracy without specialist equipment.

Details

Motivation: To make pediatric vision screening more accessible worldwide by recreating the clinical Bruckner test using mobile devices and AI, enabling early detection of visual impairments in children.

Method: Deep neural networks trained on ophthalmologist-labeled red-eye reflex images from children’s pupils, with data collection optimization for immediate user feedback.

Result: The model achieved 90% accuracy on unseen test data, providing highly reliable vision screening performance without requiring specialist equipment.

Conclusion: This work represents a significant step toward accessible pediatric vision screenings and early intervention for vision abnormalities globally using mobile technology.

Abstract: Numerous visual impairments can be detected in red-eye reflex images from young children. The so-called Bruckner test is traditionally performed by ophthalmologists in clinical settings. Thanks to the recent technological advances in smartphones and artificial intelligence, it is now possible to recreate the Bruckner test using a mobile device. In this paper, we present a first study conducted during the development of KidsVisionCheck, a free application that can perform vision screening with a mobile device using red-eye reflex images. The underlying model relies on deep neural networks trained on children’s pupil images collected and labeled by an ophthalmologist. With an accuracy of 90% on unseen test data, our model provides highly reliable performance without the necessity of specialist equipment. Furthermore, we can identify the optimal conditions for data collection, which can in turn be used to provide immediate feedback to the users. In summary, this work marks a first step toward accessible pediatric vision screenings and early intervention for vision abnormalities worldwide.

[88] DGFusion: Depth-Guided Sensor Fusion for Robust Semantic Perception

Tim Broedermannn, Christos Sakaridis, Luigi Piccinelli, Wim Abbeloos, Luc Van Gool

Main category: cs.CV

TL;DR: DGFusion is a novel depth-guided multimodal fusion method for autonomous vehicle perception that uses depth information to dynamically adapt sensor fusion based on spatially varying sensor reliability across scenes.

Details

Motivation: Current sensor fusion approaches treat sensor data uniformly across spatial extent, which hinders performance in challenging conditions. Depth information can help condition fusion based on sensor reliability.

Method: Proposes DGFusion network that treats multimodal segmentation as multi-task problem, using lidar measurements as input and depth ground truth. Includes auxiliary depth head to learn depth-aware features encoded as local depth tokens that condition attentive cross-modal fusion with global condition token.

Result: Achieves state-of-the-art panoptic and semantic segmentation performance on challenging MUSES and DELIVER datasets.

Conclusion: Depth-guided fusion with spatially varying local depth tokens and robust depth loss effectively adapts sensor fusion to varying sensor reliability across scenes, improving performance in adverse conditions.

Abstract: Robust semantic perception for autonomous vehicles relies on effectively combining multiple sensors with complementary strengths and weaknesses. State-of-the-art sensor fusion approaches to semantic perception often treat sensor data uniformly across the spatial extent of the input, which hinders performance when faced with challenging conditions. By contrast, we propose a novel depth-guided multimodal fusion method that upgrades condition-aware fusion by integrating depth information. Our network, DGFusion, poses multimodal segmentation as a multi-task problem, utilizing the lidar measurements, which are typically available in outdoor sensor suites, both as one of the model’s inputs and as ground truth for learning depth. Our corresponding auxiliary depth head helps to learn depth-aware features, which are encoded into spatially varying local depth tokens that condition our attentive cross-modal fusion. Together with a global condition token, these local depth tokens dynamically adapt sensor fusion to the spatially varying reliability of each sensor across the scene, which largely depends on depth. In addition, we propose a robust loss for our depth, which is essential for learning from lidar inputs that are typically sparse and noisy in adverse conditions. Our method achieves state-of-the-art panoptic and semantic segmentation performance on the challenging MUSES and DELIVER datasets. Code and models will be available at https://github.com/timbroed/DGFusion

[89] Patch-based Automatic Rosacea Detection Using the ResNet Deep Learning Framework

Chengyu Yang, Rishik Reddy Yesgari, Chengjun Liu

Main category: cs.CV

TL;DR: Patch-based rosacea detection using ResNet-18 with localized facial patches achieves competitive accuracy while preserving patient privacy by excluding identifiable features.

Details

Motivation: Rosacea requires early and precise detection for effective treatment. Current full-image methods may compromise patient privacy and lack focus on clinically relevant regions.

Method: Extracted various image patches from facial images in different sizes, shapes, and locations. Used ResNet-18 deep learning framework and conducted investigation studies to evaluate how localized visual information affects model performance.

Result: Patch-based strategies achieved competitive or superior accuracy and sensitivity compared to full-image methods. The approach guides the model to focus on clinically relevant regions, enhances robustness and interpretability, and protects patient privacy.

Conclusion: Patch-based automatic rosacea detection strategies offer practical insights for improving automated dermatological diagnostics by balancing accuracy with privacy preservation.

Abstract: Rosacea, which is a chronic inflammatory skin condition that manifests with facial redness, papules, and visible blood vessels, often requirs precise and early detection for significantly improving treatment effectiveness. This paper presents new patch-based automatic rosacea detection strategies using the ResNet-18 deep learning framework. The contributions of the proposed strategies come from the following aspects. First, various image pateches are extracted from the facial images of people in different sizes, shapes, and locations. Second, a number of investigation studies are carried out to evaluate how the localized visual information influences the deep learing model performance. Third, thorough experiments are implemented to reveal that several patch-based automatic rosacea detection strategies achieve competitive or superior accuracy and sensitivity than the full-image based methods. And finally, the proposed patch-based strategies, which use only localized patches, inherently preserve patient privacy by excluding any identifiable facial features from the data. The experimental results indicate that the proposed patch-based strategies guide the deep learning model to focus on clinically relevant regions, enhance robustness and interpretability, and protect patient privacy. As a result, the proposed strategies offer practical insights for improving automated dermatological diagnostics.

[90] Privacy-Preserving Automated Rosacea Detection Based on Medically Inspired Region of Interest Selection

Chengyu Yang, Rishik Reddy Yesgari, Chengjun Liu

Main category: cs.CV

TL;DR: Privacy-preserving rosacea detection using synthetic data and clinical priors with a redness-informed mask to focus on diagnostically relevant facial areas while excluding identity features.

Details

Motivation: Rosacea is underdiagnosed and automated detection faces challenges due to diffuse symptoms, lack of labeled datasets, and privacy concerns with facial images.

Method: Uses a fixed redness-informed mask to select high red-intensity regions (cheeks, nose, forehead) and trains ResNet-18 on masked synthetic images instead of full-face data.

Result: Achieves superior performance over full-face baselines with notable gains in accuracy, recall, and F1 score on real-world test data.

Conclusion: Synthetic data combined with clinical priors enables accurate and ethical dermatological AI systems for privacy-sensitive applications like telemedicine and large-scale screening.

Abstract: Rosacea is a common but underdiagnosed inflammatory skin condition that primarily affects the central face and presents with subtle redness, pustules, and visible blood vessels. Automated detection remains challenging due to the diffuse nature of symptoms, the scarcity of labeled datasets, and privacy concerns associated with using identifiable facial images. A novel privacy-preserving automated rosacea detection method inspired by clinical priors and trained entirely on synthetic data is presented in this paper. Specifically, the proposed method, which leverages the observation that rosacea manifests predominantly through central facial erythema, first constructs a fixed redness-informed mask by selecting regions with consistently high red channel intensity across facial images. The mask thus is able to focus on diagnostically relevant areas such as the cheeks, nose, and forehead and exclude identity-revealing features. Second, the ResNet-18 deep learning method, which is trained on the masked synthetic images, achieves superior performance over the full-face baselines with notable gains in terms of accuracy, recall and F1 score when evaluated using the real-world test data. The experimental results demonstrate that the synthetic data and clinical priors can jointly enable accurate and ethical dermatological AI systems, especially for privacy sensitive applications in telemedicine and large-scale screening.

[91] Investigating the Impact of Various Loss Functions and Learnable Wiener Filter for Laparoscopic Image Desmoking

Chengyu Yang, Chengjun Liu

Main category: cs.CV

TL;DR: Ablation study of ULW framework for laparoscopic image desmoking, evaluating individual components including learnable Wiener filter and loss function terms.

Details

Motivation: To rigorously assess the effectiveness and necessity of individual components within the ULW framework for laparoscopic image desmoking.

Method: Systematic ablation of components: removal of learnable Wiener filter and selective use of individual loss terms (MSE, SSIM, perceptual loss) from the compound loss function. Benchmarking on paired laparoscopic images dataset.

Result: Evaluation using quantitative metrics (SSIM, PSNR, MSE, CIEDE-2000) and qualitative visual comparisons to assess each component’s contribution.

Conclusion: The study provides comprehensive analysis of which components are essential for optimal performance in the ULW desmoking framework.

Abstract: To rigorously assess the effectiveness and necessity of individual components within the recently proposed ULW framework for laparoscopic image desmoking, this paper presents a comprehensive ablation study. The ULW approach combines a U-Net based backbone with a compound loss function that comprises mean squared error (MSE), structural similarity index (SSIM) loss, and perceptual loss. The framework also incorporates a differentiable, learnable Wiener filter module. In this study, each component is systematically ablated to evaluate its specific contribution to the overall performance of the whole framework. The analysis includes: (1) removal of the learnable Wiener filter, (2) selective use of individual loss terms from the composite loss function. All variants are benchmarked on a publicly available paired laparoscopic images dataset using quantitative metrics (SSIM, PSNR, MSE and CIEDE-2000) alongside qualitative visual comparisons.

Razvan Stefanescu, Ethan Oh, Ruben Vazquez, Chris Mesterharm, Constantin Serban, Ritu Chadha

Main category: cs.CV

TL;DR: WAVE-DETR is a multi-modal drone detector that fuses RGB visual and acoustic signals using Deformable DETR and Wav2Vec2 architectures, achieving significant performance improvements across all drone sizes.

Details

Motivation: To create a robust UAV detection system that works effectively under challenging environmental conditions by combining visual and acoustic information for improved detection accuracy.

Method: Combines Deformable DETR for visual processing with Wav2Vec2 for acoustic feature extraction. Tests four fusion configurations: gated mechanism, linear layer, MLP, and cross attention to integrate acoustic embeddings with multi-resolution visual features.

Result: Gated fusion approach performed best, improving mAP by 11.1-15.3% for small drones across IoU thresholds 0.5-0.9. Overall gains of 3.27-5.84% across all drone sizes (small, medium, large).

Conclusion: Acoustic information significantly enhances drone detection performance, with gated fusion being the most effective method for combining multi-modal features in real-world UAV detection scenarios.

Abstract: We introduce a multi-modal WAVE-DETR drone detector combining visible RGB and acoustic signals for robust real-life UAV object detection. Our approach fuses visual and acoustic features in a unified object detector model relying on the Deformable DETR and Wav2Vec2 architectures, achieving strong performance under challenging environmental conditions. Our work leverage the existing Drone-vs-Bird dataset and the newly generated ARDrone dataset containing more than 7,500 synchronized images and audio segments. We show how the acoustic information is used to improve the performance of the Deformable DETR object detector on the real ARDrone dataset. We developed, trained and tested four different fusion configurations based on a gated mechanism, linear layer, MLP and cross attention. The Wav2Vec2 acoustic embeddings are fused with the multi resolution feature mappings of the Deformable DETR and enhance the object detection performance over all drones dimensions. The best performer is the gated fusion approach, which improves the mAP of the Deformable DETR object detector on our in-distribution and out-of-distribution ARDrone datasets by 11.1% to 15.3% for small drones across all IoU thresholds between 0.5 and 0.9. The mAP scores for medium and large drones are also enhanced, with overall gains across all drone sizes ranging from 3.27% to 5.84%.

[93] Surrogate Supervision for Robust and Generalizable Deformable Image Registration

Yihao Liu, Junyu Chen, Lianrui Zuo, Shuwen Wei, Brian D. Boyd, Carmen Andreescu, Olusola Ajilore, Warren D. Taylor, Aaron Carass, Bennett A. Landman

Main category: cs.CV

TL;DR: Surrogate supervision decouples input domain from supervision domain to improve robustness of deep learning-based deformable image registration against input variations like artifacts, FOV mismatch, and modality differences.

Details

Motivation: Deep learning registration methods achieve strong accuracy but remain sensitive to variations in input image characteristics such as artifacts, field-of-view mismatch, or modality differences, limiting their generalizability.

Method: Introduces surrogate supervision which applies estimated spatial transformations to surrogate images, allowing training on heterogeneous inputs while ensuring supervision is computed in domains where similarity is well defined.

Result: Demonstrated strong resilience to input variations including inhomogeneity field, inconsistent field-of-view, and modality differences across three applications, while maintaining high performance on well-curated data.

Conclusion: Surrogate supervision provides a principled framework for training robust and generalizable deep learning-based registration models without increasing complexity, enabling broader applicability in diverse biomedical imaging scenarios.

Abstract: Objective: Deep learning-based deformable image registration has achieved strong accuracy, but remains sensitive to variations in input image characteristics such as artifacts, field-of-view mismatch, or modality difference. We aim to develop a general training paradigm that improves the robustness and generalizability of registration networks. Methods: We introduce surrogate supervision, which decouples the input domain from the supervision domain by applying estimated spatial transformations to surrogate images. This allows training on heterogeneous inputs while ensuring supervision is computed in domains where similarity is well defined. We evaluate the framework through three representative applications: artifact-robust brain MR registration, mask-agnostic lung CT registration, and multi-modal MR registration. Results: Across tasks, surrogate supervision demonstrated strong resilience to input variations including inhomogeneity field, inconsistent field-of-view, and modality differences, while maintaining high performance on well-curated data. Conclusions: Surrogate supervision provides a principled framework for training robust and generalizable deep learning-based registration models without increasing complexity. Significance: Surrogate supervision offers a practical pathway to more robust and generalizable medical image registration, enabling broader applicability in diverse biomedical imaging scenarios.

[94] An Autoencoder and Vision Transformer-based Interpretability Analysis of the Differences in Automated Staging of Second and Third Molars

Barkin Buyukcakir, Jannick De Tobel, Patrick Thevissen, Dirk Vandermeulen, Peter Claes

Main category: cs.CV

TL;DR: A framework combining convolutional autoencoder and Vision Transformer improves dental age estimation accuracy and provides diagnostic insights, revealing data-centric limitations in tooth morphology variability.

Details

Motivation: To address the 'black box' nature of deep learning models in high-stakes forensic applications like dental age estimation, and to enhance both performance and transparency.

Method: Proposed framework combining convolutional autoencoder (AE) with Vision Transformer (ViT) for automated staging of mandibular second (tooth 37) and third (tooth 38) molars.

Result: Improved classification accuracy from 0.712 to 0.815 for tooth 37 and from 0.462 to 0.543 for tooth 38. Analysis revealed performance gap is data-centric due to high intra-class morphological variability in tooth 38 dataset.

Conclusion: The framework provides both enhanced accuracy and diagnostic insights, demonstrating insufficiency of single interpretability modes and serving as a robust tool for expert decision-making in forensic age estimation.

Abstract: The practical adoption of deep learning in high-stakes forensic applications, such as dental age estimation, is often limited by the ‘black box’ nature of the models. This study introduces a framework designed to enhance both performance and transparency in this context. We use a notable performance disparity in the automated staging of mandibular second (tooth 37) and third (tooth 38) molars as a case study. The proposed framework, which combines a convolutional autoencoder (AE) with a Vision Transformer (ViT), improves classification accuracy for both teeth over a baseline ViT, increasing from 0.712 to 0.815 for tooth 37 and from 0.462 to 0.543 for tooth 38. Beyond improving performance, the framework provides multi-faceted diagnostic insights. Analysis of the AE’s latent space metrics and image reconstructions indicates that the remaining performance gap is data-centric, suggesting high intra-class morphological variability in the tooth 38 dataset is a primary limiting factor. This work highlights the insufficiency of relying on a single mode of interpretability, such as attention maps, which can appear anatomically plausible yet fail to identify underlying data issues. By offering a methodology that both enhances accuracy and provides evidence for why a model may be uncertain, this framework serves as a more robust tool to support expert decision-making in forensic age estimation.

[95] SCoDA: Self-supervised Continual Domain Adaptation

Chirayu Agrawal, Snehasis Mukherjee

Main category: cs.CV

TL;DR: SCoDA introduces self-supervised learning and geometric manifold alignment for source-free domain adaptation, outperforming existing methods by preserving crucial geometric information and avoiding supervised pre-training.

Details

Motivation: Existing SFDA methods discard important geometric information about the latent manifold by relying on cosine similarity over L2-normalized features, and they depend on supervised pre-training which limits their applicability.

Method: Initializes with self-supervised pre-trained teacher model, uses geometric manifold alignment with Space Similarity Loss, and employs EMA updates to prevent catastrophic forgetting while combining instance-level feature matching.

Result: Extensive experiments show SCoDA significantly outperforms state-of-the-art SFDA methods on benchmark datasets.

Conclusion: SCoDA successfully addresses limitations of previous SFDA approaches by leveraging self-supervised learning and geometric manifold alignment, providing a more effective framework for source-free domain adaptation.

Abstract: Source-Free Domain Adaptation (SFDA) addresses the challenge of adapting a model to a target domain without access to the data of the source domain. Prevailing methods typically start with a source model pre-trained with full supervision and distill the knowledge by aligning instance-level features. However, these approaches, relying on cosine similarity over L2-normalized feature vectors, inadvertently discard crucial geometric information about the latent manifold of the source model. We introduce Self-supervised Continual Domain Adaptation (SCoDA) to address these limitations. We make two key departures from standard practice: first, we avoid the reliance on supervised pre-training by initializing the proposed framework with a teacher model pre-trained entirely via self-supervision (SSL). Second, we adapt the principle of geometric manifold alignment to the SFDA setting. The student is trained with a composite objective combining instance-level feature matching with a Space Similarity Loss. To combat catastrophic forgetting, the teacher’s parameters are updated via an Exponential Moving Average (EMA) of the student’s parameters. Extensive experiments on benchmark datasets demonstrate that SCoDA significantly outperforms state-of-the-art SFDA methods.

[96] Segment Anything for Cell Tracking

Zhu Chen, Mert Edgü, Er Jin, Johannes Stegmaier

Main category: cs.CV

TL;DR: Zero-shot cell tracking framework using SAM2 foundation model for unsupervised microscopy video analysis without training data

Details

Motivation: Overcome limitations of manual labeling costs, poor generalizability, and dataset-specific biases in existing deep learning methods for cell tracking

Method: Integrate Segment Anything 2 (SAM2) foundation model into tracking pipeline as a fully-unsupervised approach without dataset-specific training

Result: Achieves competitive accuracy in both 2D and large-scale 3D time-lapse microscopy videos without dataset-specific adaptation

Conclusion: Proposed framework eliminates need for manual labeling and dataset-specific fine-tuning while maintaining competitive performance across diverse microscopy datasets

Abstract: Tracking cells and detecting mitotic events in time-lapse microscopy image sequences is a crucial task in biomedical research. However, it remains highly challenging due to dividing objects, low signal-tonoise ratios, indistinct boundaries, dense clusters, and the visually similar appearance of individual cells. Existing deep learning-based methods rely on manually labeled datasets for training, which is both costly and time-consuming. Moreover, their generalizability to unseen datasets remains limited due to the vast diversity of microscopy data. To overcome these limitations, we propose a zero-shot cell tracking framework by integrating Segment Anything 2 (SAM2), a large foundation model designed for general image and video segmentation, into the tracking pipeline. As a fully-unsupervised approach, our method does not depend on or inherit biases from any specific training dataset, allowing it to generalize across diverse microscopy datasets without finetuning. Our approach achieves competitive accuracy in both 2D and large-scale 3D time-lapse microscopy videos while eliminating the need for dataset-specific adaptation.

[97] Online 3D Multi-Camera Perception through Robust 2D Tracking and Depth-based Late Aggregation

Vu-Minh Le, Thao-Anh Tran, Duc Huy Do, Xuan Canh Do, Huong Ninh, Hai Tran

Main category: cs.CV

TL;DR: A method to extend 2D multi-camera tracking systems into 3D space using depth information for point-cloud reconstruction and 3D box recovery, achieving 3rd place in AI City Challenge 2025.

Details

Motivation: Existing MTMC systems are built for 2D tracking and replacing all components for 3D tracking is infeasible, so there's a need to extend existing 2D systems rather than rebuild from scratch.

Method: Utilizes depth information to reconstruct targets in point-cloud space, performs clustering and yaw refinement for 3D box recovery, and introduces enhanced online data association using local ID consistency for global ID assignment.

Result: Achieved 3rd place on the leaderboard of the 2025 AI City Challenge’s 3D MTMC dataset.

Conclusion: The approach successfully extends existing 2D multi-camera tracking systems into 3D space without requiring complete system replacement, demonstrating practical viability for large-scale surveillance automation.

Abstract: Multi-Target Multi-Camera Tracking (MTMC) is an essential computer vision task for automating large-scale surveillance. With camera calibration and depth information, the targets in the scene can be projected into 3D space, offering unparalleled levels of automatic perception of a 3D environment. However, tracking in the 3D space requires replacing all 2D tracking components from the ground up, which may be infeasible for existing MTMC systems. In this paper, we present an approach for extending any online 2D multi-camera tracking system into 3D space by utilizing depth information to reconstruct a target in point-cloud space, and recovering its 3D box through clustering and yaw refinement following tracking. We also introduced an enhanced online data association mechanism that leverages the target’s local ID consistency to assign global IDs across frames. The proposed framework is evaluated on the 2025 AI City Challenge’s 3D MTMC dataset, achieving 3rd place on the leaderboard.

[98] Zero-Shot Referring Expression Comprehension via Visual-Language True/False Verification

Jeffrey Liu, Rongbin Hu

Main category: cs.CV

TL;DR: A zero-shot visual-language verification approach for Referring Expression Comprehension that uses a general-purpose VLM to verify object proposals from YOLO-World, achieving state-of-the-art performance without REC-specific training.

Details

Motivation: To demonstrate that strong Referring Expression Comprehension performance can be achieved through workflow design rather than task-specific pretraining, reducing cross-box interference and supporting additional capabilities like abstention and multiple matches.

Method: Reformulates REC as box-wise visual-language verification: uses COCO-clean YOLO-World detector for proposals, then a general-purpose VLM independently answers True/False queries for each region without any fine-tuning.

Result: Surpasses zero-shot GroundingDINO baseline and exceeds reported results for trained GroundingDINO and GroundingDINO+CRG on RefCOCO, RefCOCO+, and RefCOCOg datasets. Verification significantly outperforms selection-based prompting in controlled studies.

Conclusion: Workflow design, rather than task-specific pretraining, is the key driver for strong zero-shot REC performance, with the verification approach providing competitive or superior results without REC-specific training.

Abstract: Referring Expression Comprehension (REC) is usually addressed with task-trained grounding models. We show that a zero-shot workflow, without any REC-specific training, can achieve competitive or superior performance. Our approach reformulates REC as box-wise visual-language verification: given proposals from a COCO-clean generic detector (YOLO-World), a general-purpose VLM independently answers True/False queries for each region. This simple procedure reduces cross-box interference, supports abstention and multiple matches, and requires no fine-tuning. On RefCOCO, RefCOCO+, and RefCOCOg, our method not only surpasses a zero-shot GroundingDINO baseline but also exceeds reported results for GroundingDINO trained on REC and GroundingDINO+CRG. Controlled studies with identical proposals confirm that verification significantly outperforms selection-based prompting, and results hold with open VLMs. Overall, we show that workflow design, rather than task-specific pretraining, drives strong zero-shot REC performance.

[99] Augment to Segment: Tackling Pixel-Level Imbalance in Wheat Disease and Pest Segmentation

Tianqi Wei, Xin Yu, Zhi Chen, Scott Chapman, Zi Huang

Main category: cs.CV

TL;DR: Proposes Random Projected Copy-and-Paste (RPCP) augmentation to address extreme pixel imbalance in wheat disease segmentation, particularly for rare insect damage classes.

Details

Motivation: Extreme pixel-level imbalance in wheat disease segmentation causes overfitting to common classes and insufficient learning of rare insect damage classes, impairing overall performance.

Method: Extracts rare insect-damage patches, applies random geometric transformations, pastes them in appropriate regions avoiding overlaps, and uses random projection filter to refine features and ensure natural blending.

Result: Substantially improves segmentation performance on insect damage class while maintaining or slightly enhancing accuracy on other categories.

Conclusion: Targeted augmentation effectively mitigates extreme pixel imbalance, offering a straightforward yet effective solution for agricultural segmentation problems.

Abstract: Accurate segmentation of foliar diseases and insect damage in wheat is crucial for effective crop management and disease control. However, the insect damage typically occupies only a tiny fraction of annotated pixels. This extreme pixel-level imbalance poses a significant challenge to the segmentation performance, which can result in overfitting to common classes and insufficient learning of rare classes, thereby impairing overall performance. In this paper, we propose a Random Projected Copy-and-Paste (RPCP) augmentation technique to address the pixel imbalance problem. Specifically, we extract rare insect-damage patches from annotated training images and apply random geometric transformations to simulate variations. The transformed patches are then pasted in appropriate regions while avoiding overlaps with lesions or existing damaged regions. In addition, we apply a random projection filter to the pasted regions, refining local features and ensuring a natural blend with the new background. Experiments show that our method substantially improves segmentation performance on the insect damage class, while maintaining or even slightly enhancing accuracy on other categories. Our results highlight the effectiveness of targeted augmentation in mitigating extreme pixel imbalance, offering a straightforward yet effective solution for agricultural segmentation problems.

[100] An HMM-based framework for identity-aware long-term multi-object tracking from sparse and uncertain identification: use case on long-term tracking in livestock

Anne Marthe Sophie Ngo Bibinbe, Chiron Bang, Patrick Gagnon, Jamie Ahloy-Dallaire, Eric R. Paquet

Main category: cs.CV

TL;DR: A new HMM framework that combines uncertain identities from sporadic sources (like feeders) with tracking to improve long-term multi-object tracking performance, validated on pig tracking and MOT benchmarks.

Details

Motivation: Existing MOT approaches suffer from identity switches over time, making them unsuitable for long-term tracking applications like livestock monitoring where sporadic identifications are available.

Method: Proposes a Hidden Markov Model framework that integrates uncertain identities from external sources (e.g., feeders) with tracking data to maintain consistent long-term object identities.

Result: Improves F1 score of ByteTrack on a 10-minute pig tracking dataset with 21 identifications, shows robustness to identification uncertainty, and validates performance on MOT17 and MOT20 benchmarks with both ByteTrack and FairMOT.

Conclusion: The HMM framework effectively leverages sporadic identifications to enhance long-term tracking performance and is robust to identification uncertainty, making it suitable for real-world applications like livestock monitoring.

Abstract: The need for long-term multi-object tracking (MOT) is growing due to the demand for analyzing individual behaviors in videos that span several minutes. Unfortunately, due to identity switches between objects, the tracking performance of existing MOT approaches decreases over time, making them difficult to apply for long-term tracking. However, in many real-world applications, such as in the livestock sector, it is possible to obtain sporadic identifications for some of the animals from sources like feeders. To address the challenges of long-term MOT, we propose a new framework that combines both uncertain identities and tracking using a Hidden Markov Model (HMM) formulation. In addition to providing real-world identities to animals, our HMM framework improves the F1 score of ByteTrack, a leading MOT approach even with re-identification, on a 10 minute pig tracking dataset with 21 identifications at the pen’s feeding station. We also show that our approach is robust to the uncertainty of identifications, with performance increasing as identities are provided more frequently. The improved performance of our HMM framework was also validated on the MOT17 and MOT20 benchmark datasets using both ByteTrack and FairMOT. The code for this new HMM framework and the new 10-minute pig tracking video dataset are available at: https://github.com/ngobibibnbe/uncertain-identity-aware-tracking

[101] Event Camera Guided Visual Media Restoration & 3D Reconstruction: A Survey

Aupendu Kar, Vishnu Raj, Guan-Ming Su

Main category: cs.CV

TL;DR: Survey paper on event camera fusion with traditional frame-based capture for video restoration and 3D reconstruction tasks, covering deep learning approaches for temporal/spatial enhancement and compiling available datasets.

Details

Motivation: Event cameras offer low latency, low power consumption, and high capture rates, but their fusion with traditional frame-based systems can significantly benefit video restoration and 3D reconstruction tasks that require handling challenging visual conditions.

Method: Systematic review of deep learning contributions for image/video enhancement, focusing on temporal enhancement (frame interpolation, motion deblurring) and spatial enhancement (super-resolution, low-light/HDR enhancement, artifact reduction), plus 3D reconstruction evolution with event-driven fusion.

Result: Comprehensive survey covering diverse topics with in-depth discussions on recent works, compilation of openly available datasets for reproducible research, and insights into how event camera fusion improves visual quality under challenging conditions.

Conclusion: The survey consolidates recent progress to inspire further research into leveraging event camera systems combined with deep learning for advanced visual media restoration and enhancement applications.

Abstract: Event camera sensors are bio-inspired sensors which asynchronously capture per-pixel brightness changes and output a stream of events encoding the polarity, location and time of these changes. These systems are witnessing rapid advancements as an emerging field, driven by their low latency, reduced power consumption, and ultra-high capture rates. This survey explores the evolution of fusing event-stream captured with traditional frame-based capture, highlighting how this synergy significantly benefits various video restoration and 3D reconstruction tasks. The paper systematically reviews major deep learning contributions to image/video enhancement and restoration, focusing on two dimensions: temporal enhancement (such as frame interpolation and motion deblurring) and spatial enhancement (including super-resolution, low-light and HDR enhancement, and artifact reduction). This paper also explores how the 3D reconstruction domain evolves with the advancement of event driven fusion. Diverse topics are covered, with in-depth discussions on recent works for improving visual quality under challenging conditions. Additionally, the survey compiles a comprehensive list of openly available datasets, enabling reproducible research and benchmarking. By consolidating recent progress and insights, this survey aims to inspire further research into leveraging event camera systems, especially in combination with deep learning, for advanced visual media restoration and enhancement.

[102] ISTASTrack: Bridging ANN and SNN via ISTA Adapter for RGB-Event Tracking

Siying Liu, Zikai Wang, Hanle Zheng, Yifan Hu, Xilin Wang, Qingkai Yang, Jibin Wu, Hao Guo, Lei Deng

Main category: cs.CV

TL;DR: ISTASTrack is a transformer-based ANN-SNN hybrid tracker with ISTA adapters for RGB-Event tracking, achieving state-of-the-art performance while maintaining energy efficiency.

Details

Motivation: Existing ANNs struggle to exploit the sparse and asynchronous nature of event streams in RGB-Event tracking, and effectively fusing features across heterogeneous ANN-SNN paradigms remains challenging.

Method: Two-branch model: vision transformer for RGB spatial context and spiking transformer for event stream spatio-temporal dynamics. Uses model-based ISTA adapters for bidirectional feature interaction and temporal downsampling attention for feature alignment.

Result: Achieves state-of-the-art performance on RGB-Event tracking benchmarks (FE240hz, VisEvent, COESOT, FELT) while maintaining high energy efficiency.

Conclusion: ISTASTrack demonstrates the effectiveness and practicality of hybrid ANN-SNN designs for robust visual tracking, successfully bridging modality and paradigm gaps between RGB and event data.

Abstract: RGB-Event tracking has become a promising trend in visual object tracking to leverage the complementary strengths of both RGB images and dynamic spike events for improved performance. However, existing artificial neural networks (ANNs) struggle to fully exploit the sparse and asynchronous nature of event streams. Recent efforts toward hybrid architectures combining ANNs and spiking neural networks (SNNs) have emerged as a promising solution in RGB-Event perception, yet effectively fusing features across heterogeneous paradigms remains a challenge. In this work, we propose ISTASTrack, the first transformer-based \textbf{A}NN-\textbf{S}NN hybrid \textbf{Track}er equipped with \textbf{ISTA} adapters for RGB-Event tracking. The two-branch model employs a vision transformer to extract spatial context from RGB inputs and a spiking transformer to capture spatio-temporal dynamics from event streams. To bridge the modality and paradigm gap between ANN and SNN features, we systematically design a model-based ISTA adapter for bidirectional feature interaction between the two branches, derived from sparse representation theory by unfolding the iterative shrinkage thresholding algorithm. Additionally, we incorporate a temporal downsampling attention module within the adapter to align multi-step SNN features with single-step ANN features in the latent space, improving temporal fusion. Experimental results on RGB-Event tracking benchmarks, such as FE240hz, VisEvent, COESOT, and FELT, have demonstrated that ISTASTrack achieves state-of-the-art performance while maintaining high energy efficiency, highlighting the effectiveness and practicality of hybrid ANN-SNN designs for robust visual tracking. The code is publicly available at https://github.com/lsying009/ISTASTrack.git.

[103] FLARE-SSM: Deep State Space Models with Influence-Balanced Loss for 72-Hour Solar Flare Prediction

Yusuke Takagi, Shunya Nagashima, Komei Sugiura

Main category: cs.CV

TL;DR: A solar flare prediction model using multiple deep state space models with FLARE loss to handle class imbalance, achieving better performance than baselines on standard metrics.

Details

Motivation: Accurate solar flare prediction is crucial for infrastructure protection, but current methods perform poorly due to severe class imbalance across flare classes.

Method: Proposed solar flare prediction model based on multiple deep state space models with frequency & local-boundary-aware reliability loss (FLARE loss) to address class imbalance.

Result: Outperformed baseline approaches in both Gandin-Murphy-Gerrity score and true skill statistic on multi-wavelength solar image dataset covering 11-year solar cycle.

Conclusion: The proposed method with FLARE loss effectively handles class imbalance and improves predictive performance and reliability for solar flare forecasting.

Abstract: Accurate and reliable solar flare predictions are essential to mitigate potential impacts on critical infrastructure. However, the current performance of solar flare forecasting is insufficient. In this study, we address the task of predicting the class of the largest solar flare expected to occur within the next 72 hours. Existing methods often fail to adequately address the severe class imbalance across flare classes. To address this issue, we propose a solar flare prediction model based on multiple deep state space models. In addition, we introduce the frequency & local-boundary-aware reliability loss (FLARE loss) to improve predictive performance and reliability under class imbalance. Experiments were conducted on a multi-wavelength solar image dataset covering a full 11-year solar activity cycle. As a result, our method outperformed baseline approaches in terms of both the Gandin-Murphy-Gerrity score and the true skill statistic, which are standard metrics in terms of the performance and reliability.

Xiaodong Guo, Tong Liu, Yike Li, Zi’ang Lin, Zhihong Deng

Main category: cs.CV

TL;DR: TUNI is a novel RGB-thermal semantic segmentation model that uses a unified encoder for simultaneous feature extraction and cross-modal fusion, achieving competitive performance with fewer parameters and real-time inference speed.

Details

Motivation: Existing RGB-T models use separate encoders pre-trained on RGB images, leading to limited thermal feature extraction, suboptimal cross-modal fusion, and redundant architecture that compromises real-time efficiency.

Method: Proposes TUNI with a unified RGB-T encoder that simultaneously performs multi-modal feature extraction and fusion, uses large-scale pre-training with RGB and pseudo-thermal data, includes a slimmed thermal branch, and employs an RGB-T local module with adaptive cosine similarity for selective feature emphasis.

Result: Achieves competitive performance with state-of-the-art models on FMB, PST900 and CART datasets, with fewer parameters and lower computational cost. Achieves 27 FPS inference speed on Jetson Orin NX.

Conclusion: TUNI demonstrates that unified feature extraction and fusion in a single encoder architecture can achieve both high performance and real-time efficiency for RGB-T semantic segmentation tasks.

Abstract: RGB-thermal (RGB-T) semantic segmentation improves the environmental perception of autonomous platforms in challenging conditions. Prevailing models employ encoders pre-trained on RGB images to extract features from both RGB and infrared inputs, and design additional modules to achieve cross-modal feature fusion. This results in limited thermal feature extraction and suboptimal cross-modal fusion, while the redundant encoders further compromises the model’s real-time efficiency. To address the above issues, we propose TUNI, with an RGB-T encoder consisting of multiple stacked blocks that simultaneously perform multi-modal feature extraction and cross-modal fusion. By leveraging large-scale pre-training with RGB and pseudo-thermal data, the RGB-T encoder learns to integrate feature extraction and fusion in a unified manner. By slimming down the thermal branch, the encoder achieves a more compact architecture. Moreover, we introduce an RGB-T local module to strengthen the encoder’s capacity for cross-modal local feature fusion. The RGB-T local module employs adaptive cosine similarity to selectively emphasize salient consistent and distinct local features across RGB-T modalities. Experimental results show that TUNI achieves competitive performance with state-of-the-art models on FMB, PST900 and CART, with fewer parameters and lower computational cost. Meanwhile, it achieves an inference speed of 27 FPS on a Jetson Orin NX, demonstrating its real-time capability in deployment. Codes are available at https://github.com/xiaodonguo/TUNI.

[105] Few-Part-Shot Font Generation

Masaki Akiba, Shumpei Takezaki, Daichi Haraguchi, Seiichi Uchida

Main category: cs.CV

TL;DR: A novel few-part-shot font generation model that creates complete fonts using only partial character shapes as input, improving efficiency and providing insights into how design details influence character structure.

Details

Motivation: Traditional few-shot font generation requires complete character shapes for multiple character classes, which can be inefficient. This approach aims to streamline font creation by using only partial design elements.

Method: The paper proposes a model that designs entire fonts based on partial shapes (design elements) rather than complete characters. It focuses on how partial design details can generate complete character structures.

Result: The model successfully generates complete fonts using only partial input shapes, demonstrating improved efficiency in font creation while maintaining quality.

Conclusion: This approach represents a significant advancement in font generation by reducing input requirements while providing valuable insights into the relationship between partial design elements and complete character structures.

Abstract: This paper proposes a novel model of few-part-shot font generation, which designs an entire font based on a set of partial design elements, i.e., partial shapes. Unlike conventional few-shot font generation, which requires entire character shapes for a couple of character classes, our approach only needs partial shapes as input. The proposed model not only improves the efficiency of font creation but also provides insights into how partial design details influence the entire structure of the individual characters.

[106] Hierarchical MLANet: Multi-level Attention for 3D Face Reconstruction From Single Images

Danling Cao

Main category: cs.CV

TL;DR: MLANet: Hierarchical Multi-Level Attention Network for 3D face reconstruction from single in-the-wild images using CNN with attention mechanisms and semi-supervised training.

Details

Motivation: Lack of ground-truth labeled datasets and complexity of real-world environments pose challenges for 3D face model recovery from 2D images.

Method: Uses pre-trained hierarchical backbone network with multi-level attention mechanisms, 3DMM parameters, and differentiable renderer for end-to-end training.

Result: Extensive experiments on AFLW2000-3D and MICC Florence datasets show effectiveness in 3D face reconstruction and alignment tasks.

Conclusion: Proposed MLANet successfully reconstructs detailed facial geometry, texture, pose, and illumination from single in-the-wild images.

Abstract: Recovering 3D face models from 2D in-the-wild images has gained considerable attention in the computer vision community due to its wide range of potential applications. However, the lack of ground-truth labeled datasets and the complexity of real-world environments remain significant challenges. In this chapter, we propose a convolutional neural network-based approach, the Hierarchical Multi-Level Attention Network (MLANet), for reconstructing 3D face models from single in-the-wild images. Our model predicts detailed facial geometry, texture, pose, and illumination parameters from a single image. Specifically, we employ a pre-trained hierarchical backbone network and introduce multi-level attention mechanisms at different stages of 2D face image feature extraction. A semi-supervised training strategy is employed, incorporating 3D Morphable Model (3DMM) parameters from publicly available datasets along with a differentiable renderer, enabling an end-to-end training process. Extensive experiments, including both comparative and ablation studies, were conducted on two benchmark datasets, AFLW2000-3D and MICC Florence, focusing on 3D face reconstruction and 3D face alignment tasks. The effectiveness of the proposed method was evaluated both quantitatively and qualitatively.

[107] LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA

Jing Huang, Zhiya Tan, Shutao Gong, Fanwei Zeng, Jianshu Li

Main category: cs.CV

TL;DR: LaV-CoT is a novel multilingual visual question answering framework that combines visual chain-of-thought reasoning with multi-aspect reward optimization, achieving state-of-the-art performance on multiple benchmarks and outperforming larger models.

Details

Motivation: Existing multilingual visual question answering approaches rely primarily on textual chain-of-thought reasoning and provide limited support for multilingual multimodal reasoning, constraining real-world deployment.

Method: LaV-CoT uses a multi-stage reasoning pipeline with text summary, language identification, spatial object captioning, and logical reasoning. It employs automated data curation and two-stage training with supervised fine-tuning plus Language-aware Group Relative Policy Optimization guided by multi-aspect rewards.

Result: Achieves up to 9.5% accuracy improvement over similar-sized open-source baselines, surpasses models with 2x larger scales by 2.6%, and outperforms proprietary models like GPT-4o-0513 and Gemini-2.5-flash. Validated through online A/B testing.

Conclusion: LaV-CoT provides an effective framework for multilingual multimodal reasoning that enhances interpretability and performance, making it suitable for industrial deployment in real-world applications.

Abstract: As large vision language models (VLMs) advance, their capabilities in multilingual visual question answering (mVQA) have significantly improved. Chain-of-thought (CoT) reasoning has been proven to enhance interpretability and complex reasoning. However, most existing approaches rely primarily on textual CoT and provide limited support for multilingual multimodal reasoning, constraining their deployment in real-world applications. To address this gap, we introduce \textbf{LaV-CoT}, the first Language-aware Visual CoT framework with Multi-Aspect Reward Optimization. LaV-CoT incorporates an interpretable multi-stage reasoning pipeline consisting of Text Summary with Bounding Box (BBox), Language Identification, Spatial Object-level Captioning, and Step-by-step Logical Reasoning. Following this reasoning pipeline, we design an automated data curation method that generates multilingual CoT annotations through iterative generation, correction, and refinement, enabling scalable and high-quality training data. To improve reasoning and generalization, LaV-CoT adopts a two-stage training paradigm combining Supervised Fine-Tuning (SFT) with Language-aware Group Relative Policy Optimization (GRPO), guided by verifiable multi-aspect rewards including language consistency, structural accuracy, and semantic alignment. Extensive evaluations on public datasets including MMMB, Multilingual MMBench, and MTVQA show that LaV-CoT achieves up to (\sim)9.5% accuracy improvements over open-source baselines of similar size and even surpasses models with 2$\times$ larger scales by (\sim)2.6%. Moreover, LaV-CoT outperforms advanced proprietary models such as GPT-4o-0513 and Gemini-2.5-flash. We further conducted an online A/B test to validate our method on real-world data, highlighting its effectiveness for industrial deployment. Our code is available at this link: \href{https://github.com/HJNVR/LaV-CoT}

[108] Color Me Correctly: Bridging Perceptual Color Spaces and Text Embeddings for Improved Diffusion Generation

Sung-Lin Tsai, Bo-Lun Huang, Yu Ting Shen, Cheng Yu Yeo, Chiang Tseng, Bo-Kai Ruan, Wen-Sheng Lien, Hong-Han Shuai

Main category: cs.CV

TL;DR: A training-free framework that uses LLMs to disambiguate color terms and refines text embeddings in CIELAB space to improve color accuracy in text-to-image generation without compromising image quality.

Details

Motivation: Current diffusion models struggle with nuanced and compound color terms, producing images misaligned with human intent. Existing approaches fail to systematically resolve ambiguous color descriptions.

Method: Uses LLM to disambiguate color-related prompts, then refines text embeddings based on spatial relationships of color terms in CIELAB color space. Training-free approach without reference images.

Result: Experimental results show improved color alignment without compromising image quality, bridging the gap between text semantics and visual generation.

Conclusion: The proposed framework effectively enhances color fidelity in T2I generation by leveraging LLM disambiguation and CIELAB space guidance, providing a systematic solution for ambiguous color descriptions.

Abstract: Accurate color alignment in text-to-image (T2I) generation is critical for applications such as fashion, product visualization, and interior design, yet current diffusion models struggle with nuanced and compound color terms (e.g., Tiffany blue, lime green, hot pink), often producing images that are misaligned with human intent. Existing approaches rely on cross-attention manipulation, reference images, or fine-tuning but fail to systematically resolve ambiguous color descriptions. To precisely render colors under prompt ambiguity, we propose a training-free framework that enhances color fidelity by leveraging a large language model (LLM) to disambiguate color-related prompts and guiding color blending operations directly in the text embedding space. Our method first employs a large language model (LLM) to resolve ambiguous color terms in the text prompt, and then refines the text embeddings based on the spatial relationships of the resulting color terms in the CIELAB color space. Unlike prior methods, our approach improves color accuracy without requiring additional training or external reference images. Experimental results demonstrate that our framework improves color alignment without compromising image quality, bridging the gap between text semantics and visual generation.

[109] Multimodal Mathematical Reasoning Embedded in Aerial Vehicle Imagery: Benchmarking, Analysis, and Exploration

Yue Zhou, Litong Feng, Mengcheng Lan, Xue Yang, Qingyun Li, Yiping Ke, Xue Jiang, Wayne Zhang

Main category: cs.CV

TL;DR: AVI-Math is the first benchmark for evaluating multimodal mathematical reasoning in aerial vehicle imagery, covering geometry, logic, and algebra with 3,773 UAV-captured questions. Current VLMs struggle significantly with these reasoning tasks despite success on other benchmarks.

Details

Motivation: Current vision-language models have not been adequately tested for mathematical reasoning in UAV-based remote sensing applications, which require precise computations for distance, area, trajectory estimations, and spatial analysis.

Method: Created AVI-Math benchmark with 3,773 high-quality vehicle-related questions from UAV views, covering 6 mathematical subjects and 20 topics. Data collected at varying altitudes and multiple angles. Evaluated 14 prominent VLMs and explored Chain-of-Thought prompting and fine-tuning techniques.

Result: Current VLMs struggle significantly with mathematical reasoning tasks in AVI-Math despite their success on previous multimodal benchmarks. The benchmark reveals significant limitations in mathematical reasoning capabilities.

Conclusion: The findings expose limitations of VLMs in mathematical reasoning and offer insights for advancing UAV-based trustworthy VLMs. Chain-of-Thought prompting and fine-tuning show promise for addressing these reasoning challenges.

Abstract: Mathematical reasoning is critical for tasks such as precise distance and area computations, trajectory estimations, and spatial analysis in unmanned aerial vehicle (UAV) based remote sensing, yet current vision-language models (VLMs) have not been adequately tested in this domain. To address this gap, we introduce AVI-Math, the first benchmark to rigorously evaluate multimodal mathematical reasoning in aerial vehicle imagery, moving beyond simple counting tasks to include domain-specific knowledge in areas such as geometry, logic, and algebra. The dataset comprises 3,773 high-quality vehicle-related questions captured from UAV views, covering 6 mathematical subjects and 20 topics. The data, collected at varying altitudes and from multiple UAV angles, reflects real-world UAV scenarios, ensuring the diversity and complexity of the constructed mathematical problems. In this paper, we benchmark 14 prominent VLMs through a comprehensive evaluation and demonstrate that, despite their success on previous multimodal benchmarks, these models struggle with the reasoning tasks in AVI-Math. Our detailed analysis highlights significant limitations in the mathematical reasoning capabilities of current VLMs and suggests avenues for future research. Furthermore, we explore the use of Chain-of-Thought prompting and fine-tuning techniques, which show promise in addressing the reasoning challenges in AVI-Math. Our findings not only expose the limitations of VLMs in mathematical reasoning but also offer valuable insights for advancing UAV-based trustworthy VLMs in real-world applications. The code, and datasets will be released at https://github.com/VisionXLab/avi-math

[110] BEVTraj: Map-Free End-to-End Trajectory Prediction in Bird’s-Eye View with Deformable Attention and Sparse Goal Proposals

Minsang Kong, Myeongjun Kim, Sang Gu Kang, Sang Hun Lee

Main category: cs.CV

TL;DR: BEVTraj is a novel trajectory prediction framework that uses real-time sensor data in bird’s-eye view space without relying on pre-built HD maps, achieving comparable performance to map-based models with greater flexibility.

Details

Motivation: Overcome limitations of pre-built HD maps (limited to specific regions, cannot adapt to transient changes) and local map construction modules (may fail to capture critical details or introduce errors) in autonomous driving trajectory prediction.

Method: Operates directly in BEV space using real-time sensor data, leverages deformable attention to extract context from dense BEV features, and introduces Sparse Goal Candidate Proposal (SGCP) module for end-to-end prediction without post-processing.

Result: Extensive experiments show BEVTraj achieves performance comparable to state-of-the-art HD map-based models while offering greater flexibility by eliminating dependency on pre-built maps.

Conclusion: BEVTraj provides an effective alternative to map-dependent approaches, enabling accurate trajectory prediction with real-time sensor data only, making it more adaptable to various driving environments.

Abstract: In autonomous driving, trajectory prediction is essential for ensuring safe and efficient navigation. To improve prediction accuracy, recent approaches often rely on pre-built high-definition (HD) maps or real-time local map construction modules to incorporate static environmental information. However, pre-built HD maps are limited to specific regions and cannot adapt to transient changes. In addition, local map construction modules, which recognize only predefined elements, may fail to capture critical scene details or introduce errors that degrade prediction performance. To overcome these limitations, we propose Bird’s-Eye View Trajectory Prediction (BEVTraj), a novel trajectory prediction framework that operates directly in the bird’s-eye view (BEV) space utilizing real-time sensor data without relying on any pre-built maps. The BEVTraj leverages deformable attention to efficiently extract relevant context from dense BEV features. Furthermore, we introduce a Sparse Goal Candidate Proposal (SGCP) module, which enables full end-to-end prediction without requiring any post-processing steps. Extensive experiments demonstrate that the BEVTraj achieves performance comparable to state-of-the-art HD map-based models while offering greater flexibility by eliminating the dependency on pre-built maps. The source code is available at https://github.com/Kongminsang/bevtraj.

[111] Leveraging Multi-View Weak Supervision for Occlusion-Aware Multi-Human Parsing

Laura Bragagnolo, Matteo Terreran, Leonardo Barcellona, Stefano Ghidoni

Main category: cs.CV

TL;DR: A novel training framework using multi-view information to improve multi-human parsing performance in occlusion scenarios, achieving 4.20% relative improvement over baseline models.

Details

Motivation: State-of-the-art multi-human parsing approaches struggle with overlapping/occluded bodies, but overlapping people appear separated from different viewpoints, suggesting multi-view information could help.

Method: Proposes a training framework with weak supervision on human instances and multi-view consistency loss, using semi-automatic annotation strategy to generate instance segmentation masks from multi-view RGB+D data and 3D human skeletons.

Result: The approach achieves up to 4.20% relative improvement on human parsing over baseline models in occlusion scenarios.

Conclusion: Multi-view information integration through novel training framework effectively improves multi-human parsing performance under occlusions.

Abstract: Multi-human parsing is the task of segmenting human body parts while associating each part to the person it belongs to, combining instance-level and part-level information for fine-grained human understanding. In this work, we demonstrate that, while state-of-the-art approaches achieved notable results on public datasets, they struggle considerably in segmenting people with overlapping bodies. From the intuition that overlapping people may appear separated from a different point of view, we propose a novel training framework exploiting multi-view information to improve multi-human parsing models under occlusions. Our method integrates such knowledge during the training process, introducing a novel approach based on weak supervision on human instances and a multi-view consistency loss. Given the lack of suitable datasets in the literature, we propose a semi-automatic annotation strategy to generate human instance segmentation masks from multi-view RGB+D data and 3D human skeletons. The experiments demonstrate that the approach can achieve up to a 4.20% relative improvement on human parsing over the baseline model in occlusion scenarios.

[112] VARCO-VISION-2.0 Technical Report

Young-rok Cha, Jeongho Ju, SunYoung Park, Jong-Hyeon Lee, Younghyun Yu, Youngjune Kim

Main category: cs.CV

TL;DR: VARCO-VISION-2.0 is an open-weight bilingual vision-language model for Korean and English that improves upon previous versions with multi-image understanding, layout-aware OCR, and enhanced multimodal alignment through four-stage curriculum training.

Details

Motivation: To advance bilingual vision-language models for Korean and English with improved capabilities in handling complex visual inputs like documents, charts, and tables while maintaining language abilities and safety.

Method: Four-stage curriculum training with memory-efficient techniques, supporting multi-image understanding and layout-aware OCR that predicts both text content and spatial location.

Result: Achieves strong spatial grounding and competitive bilingual performance, with the 14B model ranking 8th on OpenCompass VLM leaderboard among comparable models. Also releases a 1.7B version for on-device deployment.

Conclusion: VARCO-VISION-2.0 advances bilingual VLM development with practical applications, offering both full-scale and lightweight models for different deployment needs.

Abstract: We introduce VARCO-VISION-2.0, an open-weight bilingual vision-language model (VLM) for Korean and English with improved capabilities compared to the previous model VARCO-VISION-14B. The model supports multi-image understanding for complex inputs such as documents, charts, and tables, and delivers layoutaware OCR by predicting both textual content and its spatial location. Trained with a four-stage curriculum with memory-efficient techniques, the model achieves enhanced multimodal alignment, while preserving core language abilities and improving safety via preference optimization. Extensive benchmark evaluations demonstrate strong spatial grounding and competitive results for both languages, with the 14B model achieving 8th place on the OpenCompass VLM leaderboard among models of comparable scale. Alongside the 14B-scale model, we release a 1.7B version optimized for on-device deployment. We believe these models advance the development of bilingual VLMs and their practical applications. Two variants of VARCO-VISION-2.0 are available at Hugging Face: a full-scale 14B model and a lightweight 1.7B model.

[113] A Lightweight Ensemble-Based Face Image Quality Assessment Method with Correlation-Aware Loss

MohammadAli Hamidi, Hadi Amirpour, Luigi Atzori, Christian Timmerer

Main category: cs.CV

TL;DR: Lightweight face image quality assessment method using ensemble of MobileNetV3-Small and ShuffleNetV2 with correlation-aware loss, achieving high accuracy with low computational cost.

Details

Motivation: Existing FIQA methods are either not face-specific (general-purpose IQA) or computationally intensive, limiting practical deployment in real-world face recognition systems.

Method: Ensemble of two compact CNNs (MobileNetV3-Small and ShuffleNetV2) with prediction-level fusion via averaging, using MSECorrLoss that combines MSE with Pearson correlation regularizer for better human perception alignment.

Result: Achieved SRCC of 0.9829 and PLCC of 0.9894 on VQualA benchmark while meeting efficiency constraints.

Conclusion: Proposed method provides excellent balance between accuracy and computational efficiency, making it suitable for real-world face recognition applications.

Abstract: Face image quality assessment (FIQA) plays a critical role in face recognition and verification systems, especially in uncontrolled, real-world environments. Although several methods have been proposed, general-purpose no-reference image quality assessment techniques often fail to capture face-specific degradations. Meanwhile, state-of-the-art FIQA models tend to be computationally intensive, limiting their practical applicability. We propose a lightweight and efficient method for FIQA, designed for the perceptual evaluation of face images in the wild. Our approach integrates an ensemble of two compact convolutional neural networks, MobileNetV3-Small and ShuffleNetV2, with prediction-level fusion via simple averaging. To enhance alignment with human perceptual judgments, we employ a correlation-aware loss (MSECorrLoss), combining mean squared error (MSE) with a Pearson correlation regularizer. Our method achieves a strong balance between accuracy and computational cost, making it suitable for real-world deployment. Experiments on the VQualA FIQA benchmark demonstrate that our model achieves a Spearman rank correlation coefficient (SRCC) of 0.9829 and a Pearson linear correlation coefficient (PLCC) of 0.9894, remaining within competition efficiency constraints.

[114] Realism Control One-step Diffusion for Real-World Image Super-Resolution

Zongliang Wu, Siming Zheng, Peng-Tao Jiang, Xin Yuan

Main category: cs.CV

TL;DR: RCOD is a one-step diffusion framework for real-world image super-resolution that enables explicit control over fidelity-realism trade-offs through latent domain grouping and degradation-aware sampling.

Details

Motivation: One-step diffusion methods for super-resolution lack flexible control mechanisms to balance fidelity and realism across diverse scenarios, unlike multi-step methods that can adjust sampling steps.

Method: Proposes RCOD with latent domain grouping strategy for explicit fidelity-realism control, degradation-aware sampling alignment, and visual prompt injection using degradation-aware visual tokens instead of text prompts.

Result: Achieves superior fidelity and perceptual quality while maintaining computational efficiency, outperforming state-of-the-art OSD methods in both quantitative metrics and visual quality with flexible realism control.

Conclusion: RCOD provides an effective solution for real-world image super-resolution with controllable fidelity-realism trade-offs in one-step diffusion frameworks, demonstrating significant improvements over existing methods.

Abstract: Pre-trained diffusion models have shown great potential in real-world image super-resolution (Real-ISR) tasks by enabling high-resolution reconstructions. While one-step diffusion (OSD) methods significantly improve efficiency compared to traditional multi-step approaches, they still have limitations in balancing fidelity and realism across diverse scenarios. Since the OSDs for SR are usually trained or distilled by a single timestep, they lack flexible control mechanisms to adaptively prioritize these competing objectives, which are inherently manageable in multi-step methods through adjusting sampling steps. To address this challenge, we propose a Realism Controlled One-step Diffusion (RCOD) framework for Real-ISR. RCOD provides a latent domain grouping strategy that enables explicit control over fidelity-realism trade-offs during the noise prediction phase with minimal training paradigm modifications and original training data. A degradation-aware sampling strategy is also introduced to align distillation regularization with the grouping strategy and enhance the controlling of trade-offs. Moreover, a visual prompt injection module is used to replace conventional text prompts with degradation-aware visual tokens, enhancing both restoration accuracy and semantic consistency. Our method achieves superior fidelity and perceptual quality while maintaining computational efficiency. Extensive experiments demonstrate that RCOD outperforms state-of-the-art OSD methods in both quantitative metrics and visual qualities, with flexible realism control capabilities in the inference stage. The code will be released.

[115] Grad-CL: Source Free Domain Adaptation with Gradient Guided Feature Disalignment

Rini Smita Thakur, Rajeev Ranjan Dwivedi, Vinod K Kurmi

Main category: cs.CV

TL;DR: Grad-CL is a source-free domain adaptation framework for optic disc and cup segmentation that uses gradient-guided pseudolabel refinement and cosine similarity contrastive learning to improve cross-domain performance without accessing source data.

Details

Motivation: Existing segmentation models suffer performance degradation when applied to target data from different imaging protocols or conditions, requiring robust adaptation without access to original source data.

Method: Two-stage approach: 1) Gradient-based mechanism extracts class-specific features for uncertainty quantification and prototype estimation to refine noisy pseudolabels, 2) Cosine similarity contrastive loss enforces inter-class separability between optic cup and disc features.

Result: Outperforms state-of-the-art unsupervised and source-free domain adaptation methods on challenging cross-domain fundus imaging datasets, achieving superior segmentation accuracy and improved boundary delineation.

Conclusion: Grad-CL provides an effective source-free domain adaptation solution for medical image segmentation that maintains performance across different imaging conditions without requiring access to original training data.

Abstract: Accurate segmentation of the optic disc and cup is critical for the early diagnosis and management of ocular diseases such as glaucoma. However, segmentation models trained on one dataset often suffer significant performance degradation when applied to target data acquired under different imaging protocols or conditions. To address this challenge, we propose \textbf{Grad-CL}, a novel source-free domain adaptation framework that leverages a pre-trained source model and unlabeled target data to robustly adapt segmentation performance without requiring access to the original source data. Grad-CL combines a gradient-guided pseudolabel refinement module with a cosine similarity-based contrastive learning strategy. In the first stage, salient class-specific features are extracted via a gradient-based mechanism, enabling more accurate uncertainty quantification and robust prototype estimation for refining noisy pseudolabels. In the second stage, a contrastive loss based on cosine similarity is employed to explicitly enforce inter-class separability between the gradient-informed features of the optic cup and disc. Extensive experiments on challenging cross-domain fundus imaging datasets demonstrate that Grad-CL outperforms state-of-the-art unsupervised and source-free domain adaptation methods, achieving superior segmentation accuracy and improved boundary delineation. Project and code are available at https://visdomlab.github.io/GCL/.

[116] Scalable Training for Vector-Quantized Networks with 100% Codebook Utilization

Yifan Chang, Jie Qin, Limeng Qiao, Xiaofeng Wang, Zheng Zhu, Lin Ma, Xingang Wang

Main category: cs.CV

TL;DR: VQBridge is a novel projector that enables 100% codebook usage in vector quantization networks, achieving state-of-the-art reconstruction and significantly improving image generation performance when integrated with LlamaGen.

Details

Motivation: Vector quantization training suffers from instability due to straight-through estimation bias, one-step-behind updates, and sparse codebook gradients, leading to suboptimal reconstruction and low codebook usage.

Method: Proposed VQBridge, a robust projector based on map function method that optimizes code vectors through a compress-process-recover pipeline, combined with learning annealing to achieve full codebook usage.

Result: Achieves 100% codebook usage even with 262k-codebook, state-of-the-art reconstruction performance, consistent improvement with larger codebooks/higher channels/longer training, and effective across different VQ variants. When integrated with LlamaGen, surpasses VAR by 0.5 and DiT by 0.2 rFID.

Conclusion: VQBridge enables stable and effective codebook training, demonstrating the importance of high-quality tokenizers for strong autoregressive image generation and providing a scalable solution for full codebook utilization.

Abstract: Vector quantization (VQ) is a key component in discrete tokenizers for image generation, but its training is often unstable due to straight-through estimation bias, one-step-behind updates, and sparse codebook gradients, which lead to suboptimal reconstruction performance and low codebook usage. In this work, we analyze these fundamental challenges and provide a simple yet effective solution. To maintain high codebook usage in VQ networks (VQN) during learning annealing and codebook size expansion, we propose VQBridge, a robust, scalable, and efficient projector based on the map function method. VQBridge optimizes code vectors through a compress-process-recover pipeline, enabling stable and effective codebook training. By combining VQBridge with learning annealing, our VQN achieves full (100%) codebook usage across diverse codebook configurations, which we refer to as FVQ (FullVQ). Through extensive experiments, we demonstrate that FVQ is effective, scalable, and generalizable: it attains 100% codebook usage even with a 262k-codebook, achieves state-of-the-art reconstruction performance, consistently improves with larger codebooks, higher vector channels, or longer training, and remains effective across different VQ variants. Moreover, when integrated with LlamaGen, FVQ significantly enhances image generation performance, surpassing visual autoregressive models (VAR) by 0.5 and diffusion models (DiT) by 0.2 rFID, highlighting the importance of high-quality tokenizers for strong autoregressive image generation.

[117] LayerLock: Non-collapsing Representation Learning with Progressive Freezing

Goker Erdogan, Nikhil Parthasarathy, Catalin Ionescu, Drew Hudson, Alexander Lerchner, Andrew Zisserman, Mehdi Sajjadi, Joao Carreira

Main category: cs.CV

TL;DR: LayerLock accelerates masked autoencoding by progressively freezing ViT layers based on their convergence order, enabling efficient latent prediction without representation collapse.

Details

Motivation: The authors observed that ViT layers converge sequentially from shallow to deep during video MAE training, and sought to exploit this pattern to accelerate training and enable effective latent prediction.

Method: Progressive layer freezing schedule based on layer convergence order, applied to large masked autoencoding models (up to 4B parameters) for both pixel and latent prediction.

Result: Outperforms standard non-latent masked prediction on the 4DS perception suite, demonstrating scalability and effectiveness without representation collapse issues.

Conclusion: LayerLock provides a simple yet effective approach for self-supervised visual representation learning that leverages natural layer convergence patterns to accelerate training and enable robust latent prediction.

Abstract: We introduce LayerLock, a simple yet effective approach for self-supervised visual representation learning, that gradually transitions from pixel to latent prediction through progressive layer freezing. First, we make the observation that during training of video masked-autoencoding (MAE) models, ViT layers converge in the order of their depth: shallower layers converge early, deeper layers converge late. We then show that this observation can be exploited to accelerate standard MAE by progressively freezing the model according to an explicit schedule, throughout training. Furthermore, this same schedule can be used in a simple and scalable approach to latent prediction that does not suffer from “representation collapse”. We apply our proposed approach, LayerLock, to large models of up to 4B parameters with results surpassing those of non-latent masked prediction on the 4DS perception suite.

[118] On the Geometric Accuracy of Implicit and Primitive-based Representations Derived from View Rendering Constraints

Elias De Smijter, Renaud Detry, Christophe De Vleeschouwer

Main category: cs.CV

TL;DR: Appearance embeddings improve photometric fidelity but not geometric accuracy in space-based 3D object reconstruction. Convex splatting provides more compact representations than Gaussian splatting for safety-critical applications.

Details

Motivation: To evaluate the role of appearance embeddings in novel view synthesis methods for space-based 3D object reconstruction, particularly for space robotics applications where geometric accuracy is critical.

Method: Systematic comparison of implicit (K-Planes) and explicit (Gaussian Splatting, Convex Splatting) methods using the SPEED+ dataset, analyzing the impact of appearance embeddings on photometric fidelity and geometric accuracy.

Result: Embeddings improve photometric fidelity by modeling lighting variation but do not enhance geometric accuracy. They primarily reduce the number of primitives needed for explicit methods. Convex splatting achieves more compact and clutter-free representations than Gaussian splatting.

Conclusion: Appearance embeddings have limited value for geometry-centric tasks in space scenarios. Convex splatting offers advantages for safety-critical applications like interaction and collision avoidance due to its compact representation.

Abstract: We present the first systematic comparison of implicit and explicit Novel View Synthesis methods for space-based 3D object reconstruction, evaluating the role of appearance embeddings. While embeddings improve photometric fidelity by modeling lighting variation, we show they do not translate into meaningful gains in geometric accuracy - a critical requirement for space robotics applications. Using the SPEED+ dataset, we compare K-Planes, Gaussian Splatting, and Convex Splatting, and demonstrate that embeddings primarily reduce the number of primitives needed for explicit methods rather than enhancing geometric fidelity. Moreover, convex splatting achieves more compact and clutter-free representations than Gaussian splatting, offering advantages for safety-critical applications such as interaction and collision avoidance. Our findings clarify the limits of appearance embeddings for geometry-centric tasks and highlight trade-offs between reconstruction quality and representation efficiency in space scenarios.

[119] GAMMA: Generalizable Alignment via Multi-task and Manipulation-Augmented Training for AI-Generated Image Detection

Haozhen Yan, Yan Hong, Suning Lang, Jiahui Zhan, Yikun Ji, Yujie Gao, Jun Lan, Huijia Zhu, Weiqiang Wang, Jianfu Zhang

Main category: cs.CV

TL;DR: GAMMA is a novel training framework that improves AI-generated image detection by reducing domain bias and enhancing semantic alignment through diverse manipulation strategies and multi-task supervision.

Details

Motivation: Existing AI-generated image detectors struggle with generalization to unseen generative models due to reliance on generation-specific artifacts like stylistic priors and compression patterns.

Method: Proposes GAMMA framework with diverse manipulation strategies (inpainting-based manipulation, semantics-preserving perturbations), multi-task supervision with dual segmentation heads and classification head, and a reverse cross-attention mechanism for segmentation heads to guide classification.

Result: Achieves state-of-the-art generalization on GenImage benchmark with 5.8% accuracy improvement and maintains strong robustness on newly released models like GPT-4o.

Conclusion: GAMMA effectively addresses domain bias in AI-generated image detection and demonstrates superior generalization capabilities across diverse generative models.

Abstract: With generative models becoming increasingly sophisticated and diverse, detecting AI-generated images has become increasingly challenging. While existing AI-genereted Image detectors achieve promising performance on in-distribution generated images, their generalization to unseen generative models remains limited. This limitation is largely attributed to their reliance on generation-specific artifacts, such as stylistic priors and compression patterns. To address these limitations, we propose GAMMA, a novel training framework designed to reduce domain bias and enhance semantic alignment. GAMMA introduces diverse manipulation strategies, such as inpainting-based manipulation and semantics-preserving perturbations, to ensure consistency between manipulated and authentic content. We employ multi-task supervision with dual segmentation heads and a classification head, enabling pixel-level source attribution across diverse generative domains. In addition, a reverse cross-attention mechanism is introduced to allow the segmentation heads to guide and correct biased representations in the classification branch. Our method achieves state-of-the-art generalization performance on the GenImage benchmark, imporving accuracy by 5.8%, but also maintains strong robustness on newly released generative model such as GPT-4o.

[120] Robustness and Diagnostic Performance of Super-Resolution Fetal Brain MRI

Ema Masterl, Tina Vipotnik Vesnaver, Žiga Špiclin

Main category: cs.CV

TL;DR: Comparison of three fetal brain MRI super-resolution reconstruction methods (NiftyMIC, SVRTK, NeSVoR) shows NeSVoR has highest success rate (>90%) and robustness across healthy and pathological cases, with diagnostic classification unaffected by SRR method choice despite volumetric differences.

Details

Motivation: Fetal brain MRI suffers from low resolution, motion artifacts, and inadequate 3D anatomy capture. Existing super-resolution reconstruction methods' comparative performance in pathological cases and their impact on downstream analysis remain underexplored.

Method: Applied three state-of-the-art SRR methods (NiftyMIC, SVRTK, NeSVoR) to 140 fetal brain MRI scans including healthy controls and pathological cases with ventriculomegaly. Each reconstruction was segmented using BoUNTi algorithm to extract volumes of nine brain structures.

Result: NeSVoR demonstrated highest reconstruction success rate (>90%) across both healthy and pathological groups. Significant differences in volumetric estimates were observed between SRR methods, but classification performance for ventriculomegaly was not affected by SRR method choice.

Conclusion: NeSVoR shows superior robustness and consistent performance. Diagnostic classification remains resilient despite SRR-induced volumetric variability, highlighting the method’s clinical applicability.

Abstract: Fetal brain MRI relies on rapid multi-view 2D slice acquisitions to reduce motion artifacts caused by fetal movement. However, these stacks are typically low resolution, may suffer from motion corruption, and do not adequately capture 3D anatomy. Super-resolution reconstruction (SRR) methods aim to address these limitations by combining slice-to-volume registration and super-resolution techniques to generate high-resolution (HR) 3D volumes. While several SRR methods have been proposed, their comparative performance - particularly in pathological cases - and their influence on downstream volumetric analysis and diagnostic tasks remain underexplored. In this study, we applied three state-of-the-art SRR method - NiftyMIC, SVRTK, and NeSVoR - to 140 fetal brain MRI scans, including both healthy controls (HC) and pathological cases (PC) with ventriculomegaly (VM). Each HR reconstruction was segmented using the BoUNTi algorithm to extract volumes of nine principal brain structures. We evaluated visual quality, SRR success rates, volumetric measurement agreement, and diagnostic classification performance. NeSVoR demonstrated the highest and most consistent reconstruction success rate (>90%) across both HC and PC groups. Although significant differences in volumetric estimates were observed between SRR methods, classification performance for VM was not affected by the choice of SRR method. These findings highlight NeSVoR’s robustness and the resilience of diagnostic performance despite SRR-induced volumetric variability.

[121] Mask Consistency Regularization in Object Removal

Hua Yuan, Jin Yuan, Yicheng Jiang, Yao Zhang, Xin Geng, Yong Rui

Main category: cs.CV

TL;DR: Proposes Mask Consistency Regularization (MCR) to address mask hallucination and mask-shape bias in object removal image inpainting using diffusion models.

Details

Motivation: Current diffusion-based inpainting methods suffer from mask hallucination (generating irrelevant content) and mask-shape bias (filling mask area with shape-mimicking objects rather than contextual content).

Method: Introduces MCR training strategy with two mask perturbations: dilation (aligns output with surrounding content) and reshape (breaks mask-shape bias), enforcing consistency between perturbed and original mask outputs.

Result: MCR significantly reduces hallucinations and mask-shape bias, leading to improved object removal performance with more robust and contextually coherent inpainting results.

Conclusion: The proposed Mask Consistency Regularization effectively addresses key challenges in object removal tasks, producing better inpainting quality by maintaining contextual coherence and reducing artifacts.

Abstract: Object removal, a challenging task within image inpainting, involves seamlessly filling the removed region with content that matches the surrounding context. Despite advancements in diffusion models, current methods still face two critical challenges. The first is mask hallucination, where the model generates irrelevant or spurious content inside the masked region, and the second is mask-shape bias, where the model fills the masked area with an object that mimics the mask’s shape rather than surrounding content. To address these issues, we propose Mask Consistency Regularization (MCR), a novel training strategy designed specifically for object removal tasks. During training, our approach introduces two mask perturbations: dilation and reshape, enforcing consistency between the outputs of these perturbed branches and the original mask. The dilated masks help align the model’s output with the surrounding content, while reshaped masks encourage the model to break the mask-shape bias. This combination of strategies enables MCR to produce more robust and contextually coherent inpainting results. Our experiments demonstrate that MCR significantly reduces hallucinations and mask-shape bias, leading to improved performance in object removal.

[122] MagicMirror: A Large-Scale Dataset and Benchmark for Fine-Grained Artifacts Assessment in Text-to-Image Generation

Jia Wang, Jie Hu, Xiaoqi Ma, Hanghang Ma, Yanbing Zeng, Xiaoming Wei

Main category: cs.CV

TL;DR: MagicMirror is a comprehensive framework for assessing physical artifacts in text-to-image generation, featuring a detailed artifact taxonomy, large-scale human-annotated dataset, trained VLM assessor, and automated benchmark revealing significant artifacts in current models.

Details

Motivation: Current text-to-image models suffer from persistent physical artifacts (anatomical and structural flaws) that degrade perceptual quality, but lack systematic evaluation frameworks to address these issues.

Method: Developed detailed artifact taxonomy, created MagicData340K (340K human-annotated images), trained MagicAssessor VLM with novel data sampling and multi-level reward system using GRPO, and built MagicBench automated benchmark.

Result: Evaluation revealed that even top-tier T2I models like GPT-image-1 consistently suffer from significant artifacts, demonstrating the framework’s effectiveness in identifying model weaknesses.

Conclusion: Artifact reduction remains a critical challenge for T2I development, and MagicMirror provides a comprehensive solution for systematic artifact assessment and benchmarking.

Abstract: Text-to-image (T2I) generation has achieved remarkable progress in instruction following and aesthetics. However, a persistent challenge is the prevalence of physical artifacts, such as anatomical and structural flaws, which severely degrade perceptual quality and limit application. Given the diversity and complexity of these artifacts, a systematic and fine-grained evaluation framework is required, which is lacking in current benchmarks. To fill this gap, we introduce MagicMirror, a comprehensive framework for artifacts assessment. We first establish a detailed taxonomy of generated image artifacts. Guided by this taxonomy, we manually annotate MagicData340K, the first human-annotated large-scale dataset of 340K generated images with fine-grained artifact labels. Building on this dataset, we train MagicAssessor, a Vision-Language Model (VLM) that provides detailed assessments and corresponding labels. To overcome challenges like class imbalance and reward hacking, we design a novel data sampling strategy and a multi-level reward system for Group Relative Policy Optimization (GRPO). Finally, we leverage MagicAssessor to construct MagicBench, an automated benchmark for evaluating the image artifacts of current T2I models. Our evaluation with MagicBench reveals that despite their widespread adoption, even top-tier models like GPT-image-1 are consistently plagued by significant artifacts, highlighting artifact reduction as a critical frontier for future T2I development. Project page: https://wj-inf.github.io/MagicMirror-page/.

[123] SignClip: Leveraging Mouthing Cues for Sign Language Translation by Multimodal Contrastive Fusion

Wenfang Wu, Tingting Yuan, Yupeng Li, Daling Wang, Xiaoming Fu

Main category: cs.CV

TL;DR: SignClip improves sign language translation by integrating both manual (hand gestures) and non-manual (lip movements) cues using hierarchical contrastive learning, achieving state-of-the-art results on benchmark datasets.

Details

Motivation: Most existing sign language translation approaches focus only on manual signals (hand gestures) and overlook non-manual cues like mouthing, which are crucial for disambiguating visually similar signs and conveying essential linguistic information.

Method: Proposes SignClip framework that fuses spatial gesture and lip movement features, and introduces hierarchical contrastive learning with multi-level alignment objectives to ensure semantic consistency across sign-lip and visual-text modalities.

Result: Extensive experiments on PHOENIX14T and How2Sign datasets show superiority. On PHOENIX14T in gloss-free setting, SignClip improves BLEU-4 from 24.32 to 24.71 and ROUGE from 46.57 to 48.38, surpassing previous state-of-the-art model SpaMo.

Conclusion: The integration of both manual and non-manual cues through hierarchical contrastive learning significantly improves sign language translation accuracy, demonstrating the importance of considering comprehensive visual information beyond just hand gestures.

Abstract: Sign language translation (SLT) aims to translate natural language from sign language videos, serving as a vital bridge for inclusive communication. While recent advances leverage powerful visual backbones and large language models, most approaches mainly focus on manual signals (hand gestures) and tend to overlook non-manual cues like mouthing. In fact, mouthing conveys essential linguistic information in sign languages and plays a crucial role in disambiguating visually similar signs. In this paper, we propose SignClip, a novel framework to improve the accuracy of sign language translation. It fuses manual and non-manual cues, specifically spatial gesture and lip movement features. Besides, SignClip introduces a hierarchical contrastive learning framework with multi-level alignment objectives, ensuring semantic consistency across sign-lip and visual-text modalities. Extensive experiments on two benchmark datasets, PHOENIX14T and How2Sign, demonstrate the superiority of our approach. For example, on PHOENIX14T, in the Gloss-free setting, SignClip surpasses the previous state-of-the-art model SpaMo, improving BLEU-4 from 24.32 to 24.71, and ROUGE from 46.57 to 48.38.

[124] Detecting Text Manipulation in Images using Vision Language Models

Vidit Vidit, Pavel Korshunov, Amir Mohammadi, Christophe Ecabert, Ketan Kotwal, Sébastien Marcel

Main category: cs.CV

TL;DR: Analysis of VLMs for text manipulation detection, showing open-source models lag behind closed-source ones like GPT-4o, and revealing generalization issues in image manipulation-specific VLMs when applied to text detection.

Details

Motivation: While Large Vision Language Models have shown effectiveness in image manipulation detection, text manipulation detection remains largely unexplored, creating a knowledge gap that needs to be addressed.

Method: Benchmarked both closed- and open-source VLMs on various text manipulation datasets, including in-the-wild scene texts and fantasy ID cards that simulate real-world misuse scenarios.

Result: Open-source models are improving but still behind closed-source models like GPT-4o. Image manipulation detection-specific VLMs suffer from generalization problems when applied to text manipulation tasks.

Conclusion: There is a significant performance gap between open-source and closed-source VLMs for text manipulation detection, and specialized image manipulation models don’t generalize well to text detection tasks, highlighting the need for dedicated text manipulation detection research.

Abstract: Recent works have shown the effectiveness of Large Vision Language Models (VLMs or LVLMs) in image manipulation detection. However, text manipulation detection is largely missing in these studies. We bridge this knowledge gap by analyzing closed- and open-source VLMs on different text manipulation datasets. Our results suggest that open-source models are getting closer, but still behind closed-source ones like GPT- 4o. Additionally, we benchmark image manipulation detection-specific VLMs for text manipulation detection and show that they suffer from the generalization problem. We benchmark VLMs for manipulations done on in-the-wild scene texts and on fantasy ID cards, where the latter mimic a challenging real-world misuse.

[125] MCL-AD: Multimodal Collaboration Learning for Zero-Shot 3D Anomaly Detection

Gang Li, Tianjiao Chen, Mingle Zhou, Min Li, Delong Han, Jin Wan

Main category: cs.CV

TL;DR: MCL-AD is a novel zero-shot 3D anomaly detection framework that leverages multimodal collaboration across point clouds, RGB images, and text semantics, achieving state-of-the-art performance without labeled training data.

Details

Motivation: Existing methods focus only on point clouds and neglect rich semantic cues from complementary modalities like RGB images and text priors, which limits performance in zero-shot 3D anomaly detection scenarios.

Method: Proposes Multimodal Prompt Learning Mechanism (MPLM) with object-agnostic decoupled text prompts and multimodal contrastive loss, plus a collaborative modulation mechanism (CMM) to jointly modulate RGB image-guided and point cloud-guided branches.

Result: Extensive experiments demonstrate that MCL-AD achieves state-of-the-art performance in zero-shot 3D anomaly detection.

Conclusion: The framework successfully leverages multimodal collaboration across point clouds, RGB images, and text semantics to achieve superior zero-shot 3D anomaly detection performance.

Abstract: Zero-shot 3D (ZS-3D) anomaly detection aims to identify defects in 3D objects without relying on labeled training data, making it especially valuable in scenarios constrained by data scarcity, privacy, or high annotation cost. However, most existing methods focus exclusively on point clouds, neglecting the rich semantic cues available from complementary modalities such as RGB images and texts priors. This paper introduces MCL-AD, a novel framework that leverages multimodal collaboration learning across point clouds, RGB images, and texts semantics to achieve superior zero-shot 3D anomaly detection. Specifically, we propose a Multimodal Prompt Learning Mechanism (MPLM) that enhances the intra-modal representation capability and inter-modal collaborative learning by introducing an object-agnostic decoupled text prompt and a multimodal contrastive loss. In addition, a collaborative modulation mechanism (CMM) is proposed to fully leverage the complementary representations of point clouds and RGB images by jointly modulating the RGB image-guided and point cloud-guided branches. Extensive experiments demonstrate that the proposed MCL-AD framework achieves state-of-the-art performance in ZS-3D anomaly detection.

[126] Adversarial robustness through Lipschitz-Guided Stochastic Depth in Neural Networks

Laith Nayal, Mahmoud Mousatat, Bader Rasheed

Main category: cs.CV

TL;DR: Lipschitz-guided stochastic depth method with depth-dependent drop probabilities improves robustness while maintaining accuracy and reducing computation in Vision Transformers.

Details

Motivation: Deep neural networks and Vision Transformers are vulnerable to adversarial attacks, and existing defenses are computationally expensive or lack formal guarantees.

Method: Proposed Lipschitz-guided stochastic depth (DropPath) method where drop probabilities increase with depth to control the effective Lipschitz constant, regularizing deeper layers.

Result: Experiments on CIFAR-10 with ViT-Tiny show near-baseline clean accuracy, enhanced robustness against FGSM, PGD-20, and AutoAttack attacks, and significant FLOPs reduction compared to baseline and linear DropPath schedules.

Conclusion: The depth-dependent drop probability schedule effectively improves adversarial robustness while preserving accuracy and reducing computational cost in Vision Transformers.

Abstract: Deep neural networks and Vision Transformers achieve state-of-the-art performance in computer vision but are highly vulnerable to adversarial perturbations. Standard defenses often incur high computational cost or lack formal guarantees. We propose a Lipschitz-guided stochastic depth (DropPath) method, where drop probabilities increase with depth to control the effective Lipschitz constant of the network. This approach regularizes deeper layers, improving robustness while preserving clean accuracy and reducing computation. Experiments on CIFAR-10 with ViT-Tiny show that our custom depth-dependent schedule maintains near-baseline clean accuracy, enhances robustness under FGSM, PGD-20, and AutoAttack, and significantly reduces FLOPs compared to baseline and linear DropPath schedules.

[127] A Stochastic Birth-and-Death Approach for Street Furniture Geolocation in Urban Environments

Evan Murphy, Marco Viola, Vladimir A. Krylov

Main category: cs.CV

TL;DR: Probabilistic framework using energy maps for precise geolocation of street furniture in urban environments, integrating GIS data with stochastic optimization.

Details

Motivation: Address the critical need for effective monitoring and maintenance of public infrastructure by local authorities and private stakeholders through accurate street furniture geolocation.

Method: Propose a probabilistic framework based on energy maps that encode spatial likelihood of object locations, using stochastic birth-and-death optimization algorithm to infer optimal asset configurations while integrating external geospatial information like GIS layers and road maps.

Result: Evaluated using realistic simulation informed by geolocated dataset of street lighting infrastructure in Dublin city centre, demonstrating potential for scalable and accurate urban asset mapping.

Conclusion: The framework shows promise for improving contextual awareness and localization accuracy in complex urban environments, with implementation made publicly available on GitHub.

Abstract: In this paper we address the problem of precise geolocation of street furniture in complex urban environments, which is a critical task for effective monitoring and maintenance of public infrastructure by local authorities and private stakeholders. To this end, we propose a probabilistic framework based on energy maps that encode the spatial likelihood of object locations. Representing the energy in a map-based geopositioned format allows the optimisation process to seamlessly integrate external geospatial information, such as GIS layers, road maps, or placement constraints, which improves contextual awareness and localisation accuracy. A stochastic birth-and-death optimisation algorithm is introduced to infer the most probable configuration of assets. We evaluate our approach using a realistic simulation informed by a geolocated dataset of street lighting infrastructure in Dublin city centre, demonstrating its potential for scalable and accurate urban asset mapping. The implementation of the algorithm will be made available in the GitHub repository https://github.com/EMurphy0108/SBD_Street_Furniture.

[128] I-Segmenter: Integer-Only Vision Transformer for Efficient Semantic Segmentation

Jordan Sassoon, Michal Szczepanski, Martyna Poreba

Main category: cs.CV

TL;DR: I-Segmenter is the first fully integer-only Vision Transformer for semantic segmentation that achieves efficient deployment while maintaining competitive accuracy through quantization optimization and novel activation functions.

Details

Motivation: Vision Transformers for semantic segmentation have high memory and computational costs that limit deployment on resource-constrained devices, and they are fragile under low-precision quantization.

Method: Systematically replaces floating-point operations with integer-only counterparts, introduces λ-ShiftGELU activation function to handle long-tailed distributions, removes L2 normalization, and replaces bilinear interpolation with nearest neighbor upsampling.

Result: Achieves accuracy within 5.1% of FP32 baseline on average, reduces model size by up to 3.8x, enables 1.2x faster inference, and works well even with one-shot PTQ using a single calibration image.

Conclusion: I-Segmenter provides a practical integer-only solution for efficient ViT-based semantic segmentation deployment with minimal accuracy loss and significant efficiency gains.

Abstract: Vision Transformers (ViTs) have recently achieved strong results in semantic segmentation, yet their deployment on resource-constrained devices remains limited due to their high memory footprint and computational cost. Quantization offers an effective strategy to improve efficiency, but ViT-based segmentation models are notoriously fragile under low precision, as quantization errors accumulate across deep encoder-decoder pipelines. We introduce I-Segmenter, the first fully integer-only ViT segmentation framework. Building on the Segmenter architecture, I-Segmenter systematically replaces floating-point operations with integer-only counterparts. To further stabilize both training and inference, we propose $\lambda$-ShiftGELU, a novel activation function that mitigates the limitations of uniform quantization in handling long-tailed activation distributions. In addition, we remove the L2 normalization layer and replace bilinear interpolation in the decoder with nearest neighbor upsampling, ensuring integer-only execution throughout the computational graph. Extensive experiments show that I-Segmenter achieves accuracy within a reasonable margin of its FP32 baseline (5.1 % on average), while reducing model size by up to 3.8x and enabling up to 1.2x faster inference with optimized runtimes. Notably, even in one-shot PTQ with a single calibration image, I-Segmenter delivers competitive accuracy, underscoring its practicality for real-world deployment.

[129] Compute Only 16 Tokens in One Timestep: Accelerating Diffusion Transformers with Cluster-Driven Feature Caching

Zhixin Zheng, Xinyu Wang, Chang Zou, Shaobo Wang, Linfeng Zhang

Main category: cs.CV

TL;DR: ClusCa accelerates diffusion transformers by performing spatial clustering to reduce token computation by over 90%, achieving 4.96x speedup on FLUX with improved quality.

Details

Motivation: Diffusion transformers suffer from high computational costs due to iterative denoising, and existing feature caching methods only leverage temporal similarity while ignoring spatial similarity.

Method: Performs spatial clustering on tokens in each timestep, computes only one token per cluster, and propagates information to all other tokens in the cluster.

Result: Achieves 4.96x acceleration on FLUX with 99.49% ImageReward (0.51% improvement over original), reduces tokens by over 90%, works for both text-to-image and text-to-video generation.

Conclusion: ClusCa provides an effective orthogonal approach to existing feature caching methods, enabling significant acceleration without training requirements while maintaining or improving quality.

Abstract: Diffusion transformers have gained significant attention in recent years for their ability to generate high-quality images and videos, yet still suffer from a huge computational cost due to their iterative denoising process. Recently, feature caching has been introduced to accelerate diffusion transformers by caching the feature computation in previous timesteps and reusing it in the following timesteps, which leverage the temporal similarity of diffusion models while ignoring the similarity in the spatial dimension. In this paper, we introduce Cluster-Driven Feature Caching (ClusCa) as an orthogonal and complementary perspective for previous feature caching. Specifically, ClusCa performs spatial clustering on tokens in each timestep, computes only one token in each cluster and propagates their information to all the other tokens, which is able to reduce the number of tokens by over 90%. Extensive experiments on DiT, FLUX and HunyuanVideo demonstrate its effectiveness in both text-to-image and text-to-video generation. Besides, it can be directly applied to any diffusion transformer without requirements for training. For instance, ClusCa achieves 4.96x acceleration on FLUX with an ImageReward of 99.49%, surpassing the original model by 0.51%. The code is available at https://github.com/Shenyi-Z/Cache4Diffusion.

[130] GLAM: Geometry-Guided Local Alignment for Multi-View VLP in Mammography

Yuexi Du, Lihui Chen, Nicha C. Dvornek

Main category: cs.CV

TL;DR: GLAM is a foundation visual language model for mammography that uses geometry-guided global and local alignment to better capture multi-view relationships in breast cancer screening, outperforming existing methods.

Details

Motivation: Mammography interpretation speed and accuracy can be improved with deep learning, but current VLMs adapted from natural images ignore domain-specific multi-view relationships that radiologists use for accurate diagnosis.

Method: Proposes GLAM model with geometry-guided pretraining using joint global and local, visual-visual, and visual-language contrastive learning to capture cross-view alignments and fine-grained local features in mammograms.

Result: Outperforms baseline models across multiple datasets when pretrained on EMBED, one of the largest open mammography datasets.

Conclusion: The proposed geometry-guided approach successfully addresses the limitations of existing mammography VLMs by properly modeling multi-view correspondence learning, preserving critical geometric context for improved mammography interpretation.

Abstract: Mammography screening is an essential tool for early detection of breast cancer. The speed and accuracy of mammography interpretation have the potential to be improved with deep learning methods. However, the development of a foundation visual language model (VLM) is hindered by limited data and domain differences between natural and medical images. Existing mammography VLMs, adapted from natural images, often ignore domain-specific characteristics, such as multi-view relationships in mammography. Unlike radiologists who analyze both views together to process ipsilateral correspondence, current methods treat them as independent images or do not properly model the multi-view correspondence learning, losing critical geometric context and resulting in suboptimal prediction. We propose GLAM: Global and Local Alignment for Multi-view mammography for VLM pretraining using geometry guidance. By leveraging the prior knowledge about the multi-view imaging process of mammograms, our model learns local cross-view alignments and fine-grained local features through joint global and local, visual-visual, and visual-language contrastive learning. Pretrained on EMBED [14], one of the largest open mammography datasets, our model outperforms baselines across multiple datasets under different settings.

[131] GARD: Gamma-based Anatomical Restoration and Denoising for Retinal OCT

Botond Fazekas, Thomas Pinetz, Guilherme Aresta, Taha Emre, Hrvoje Bogunovic

Main category: cs.CV

TL;DR: GARD is a novel deep learning approach for OCT image despeckling that uses gamma-based diffusion models and a noise-reduced fidelity term to better preserve anatomical details while reducing speckle noise.

Details

Motivation: OCT images suffer from speckle noise that obscures fine details and hinders accurate diagnosis. Existing denoising methods struggle to balance noise reduction with preservation of crucial anatomical structures.

Method: Uses Denoising Diffusion Gamma Model instead of Gaussian noise assumption, introduces Noise-Reduced Fidelity Term with pre-processed less-noisy image guidance, and adapts Denoising Diffusion Implicit Model framework for faster inference.

Result: Significantly outperforms traditional denoising methods and state-of-the-art deep learning models in PSNR, SSIM, and MSE metrics. Produces sharper edges and better preserves fine anatomical details.

Conclusion: GARD effectively addresses OCT speckle noise while maintaining structural integrity, offering superior performance over existing methods through its gamma-based diffusion approach and guided denoising process.

Abstract: Optical Coherence Tomography (OCT) is a vital imaging modality for diagnosing and monitoring retinal diseases. However, OCT images are inherently degraded by speckle noise, which obscures fine details and hinders accurate interpretation. While numerous denoising methods exist, many struggle to balance noise reduction with the preservation of crucial anatomical structures. This paper introduces GARD (Gamma-based Anatomical Restoration and Denoising), a novel deep learning approach for OCT image despeckling that leverages the strengths of diffusion probabilistic models. Unlike conventional diffusion models that assume Gaussian noise, GARD employs a Denoising Diffusion Gamma Model to more accurately reflect the statistical properties of speckle. Furthermore, we introduce a Noise-Reduced Fidelity Term that utilizes a pre-processed, less-noisy image to guide the denoising process. This crucial addition prevents the reintroduction of high-frequency noise. We accelerate the inference process by adapting the Denoising Diffusion Implicit Model framework to our Gamma-based model. Experiments on a dataset with paired noisy and less-noisy OCT B-scans demonstrate that GARD significantly outperforms traditional denoising methods and state-of-the-art deep learning models in terms of PSNR, SSIM, and MSE. Qualitative results confirm that GARD produces sharper edges and better preserves fine anatomical details.

[132] Towards Understanding Visual Grounding in Visual Language Models

Georgios Pantazopoulos, Eda B. Özyiğit

Main category: cs.CV

TL;DR: This survey paper reviews visual grounding in vision language models (VLMs), covering its importance, core components, applications, benchmarks, and relationships with multimodal reasoning, while analyzing challenges and future directions.

Details

Motivation: Visual grounding enables models to identify image regions matching textual descriptions, which is crucial for applications like referring expression comprehension, visual question answering, and fine-grained control in various domains.

Method: The paper surveys representative works across key research areas, outlines grounding importance in VLMs, delineates core components of grounded model development, examines practical applications and benchmarks, and analyzes interrelations with multimodal reasoning.

Result: The survey provides a comprehensive review of visual grounding capabilities in modern VLMs, including evaluation metrics for grounded multimodal generation and the relationship between grounding, chain-of-thought, and reasoning.

Conclusion: The paper identifies challenges in visual grounding and suggests promising future research directions for improving grounding capabilities in vision language models across various applications.

Abstract: Visual grounding refers to the ability of a model to identify a region within some visual input that matches a textual description. Consequently, a model equipped with visual grounding capabilities can target a wide range of applications in various domains, including referring expression comprehension, answering questions pertinent to fine-grained details in images or videos, caption visual context by explicitly referring to entities, as well as low and high-level control in simulated and real environments. In this survey paper, we review representative works across the key areas of research on modern general-purpose vision language models (VLMs). We first outline the importance of grounding in VLMs, then delineate the core components of the contemporary paradigm for developing grounded models, and examine their practical applications, including benchmarks and evaluation metrics for grounded multimodal generation. We also discuss the multifaceted interrelations among visual grounding, multimodal chain-of-thought, and reasoning in VLMs. Finally, we analyse the challenges inherent to visual grounding and suggest promising directions for future research.

[133] Immunizing Images from Text to Image Editing via Adversarial Cross-Attention

Matteo Trippodo, Federico Becattini, Lorenzo Seidenari

Main category: cs.CV

TL;DR: Attention Attack disrupts text-based image editing by using automated image captions to break cross-attention alignment between visual content and textual prompts, without needing knowledge of the editing method or prompt.

Details

Motivation: Text-based image editing methods are vulnerable to adversarial attacks, particularly targeting the visual component. Existing methods lack robustness against attacks that disrupt the alignment between image content and textual descriptions.

Method: Proposes Attention Attack that uses automatically generated captions of source images as proxy edit prompts to disrupt cross-attention mechanisms. Introduces two novel evaluation metrics: Caption Similarity for semantic consistency and semantic IoU for spatial layout disruption.

Result: Experiments on TEDBench++ show the attack significantly degrades editing performance while remaining imperceptible, demonstrating effectiveness against text-based image editing systems.

Conclusion: The proposed Attention Attack successfully targets the visual component of text-based image editing methods, highlighting vulnerabilities in current systems and providing new evaluation metrics for assessing attack effectiveness.

Abstract: Recent advances in text-based image editing have enabled fine-grained manipulation of visual content guided by natural language. However, such methods are susceptible to adversarial attacks. In this work, we propose a novel attack that targets the visual component of editing methods. We introduce Attention Attack, which disrupts the cross-attention between a textual prompt and the visual representation of the image by using an automatically generated caption of the source image as a proxy for the edit prompt. This breaks the alignment between the contents of the image and their textual description, without requiring knowledge of the editing method or the editing prompt. Reflecting on the reliability of existing metrics for immunization success, we propose two novel evaluation strategies: Caption Similarity, which quantifies semantic consistency between original and adversarial edits, and semantic Intersection over Union (IoU), which measures spatial layout disruption via segmentation masks. Experiments conducted on the TEDBench++ benchmark demonstrate that our attack significantly degrades editing performance while remaining imperceptible.

[134] Efficient Learned Image Compression Through Knowledge Distillation

Fabien Allemand, Attilio Fiandrotti, Sumanta Chaudhuri, Alaa Eddine Mazouz

Main category: cs.CV

TL;DR: Knowledge distillation applied to neural image compression reduces computational requirements while maintaining performance across different architectures and quality/bitrate tradeoffs.

Details

Motivation: Neural network-based image compression methods outperform conventional codecs but require significant processing power, making them unsuitable for real-time use on resource-constrained platforms.

Method: Leveraging knowledge distillation where smaller neural networks are trained on outputs from larger, more complex teacher models to achieve better performance than independent training.

Result: Knowledge distillation effectively reduces resource requirements for image compression across various architecture sizes while achieving different image quality/bit rate tradeoffs and saving processing/energy resources.

Conclusion: Knowledge distillation is a viable approach for making neural image compression more practical for real-world applications, with potential for extension to transformer-based models and exploration of different teacher models and loss functions.

Abstract: Learned image compression sits at the intersection of machine learning and image processing. With advances in deep learning, neural network-based compression methods have emerged. In this process, an encoder maps the image to a low-dimensional latent space, which is then quantized, entropy-coded into a binary bitstream, and transmitted to the receiver. At the receiver end, the bitstream is entropy-decoded, and a decoder reconstructs an approximation of the original image. Recent research suggests that these models consistently outperform conventional codecs. However, they require significant processing power, making them unsuitable for real-time use on resource-constrained platforms, which hinders their deployment in mainstream applications. This study aims to reduce the resource requirements of neural networks used for image compression by leveraging knowledge distillation, a training paradigm where smaller neural networks, partially trained on the outputs of larger, more complex models, can achieve better performance than when trained independently. Our work demonstrates that knowledge distillation can be effectively applied to image compression tasks: i) across various architecture sizes, ii) to achieve different image quality/bit rate tradeoffs, and iii) to save processing and energy resources. This approach introduces new settings and hyperparameters, and future research could explore the impact of different teacher models, as well as alternative loss functions. Knowledge distillation could also be extended to transformer-based models. The code is publicly available at: https://github.com/FABallemand/PRIM .

[135] Multimodal SAM-adapter for Semantic Segmentation

Iacopo Curti, Pierluigi Zama Ramirez, Alioscia Petrelli, Luigi Di Stefano

Main category: cs.CV

TL;DR: MM SAM-adapter extends Segment Anything Model for multimodal semantic segmentation using adapter networks to fuse auxiliary sensor data with RGB features, achieving state-of-the-art performance on challenging benchmarks.

Details

Motivation: Current semantic segmentation methods are vulnerable to challenging conditions like poor lighting, occlusions, and adverse weather. Multimodal approaches integrating auxiliary sensor data can provide complementary information to enhance robustness.

Method: Proposes an adapter network that injects fused multimodal features (e.g., LiDAR, infrared) into SAM’s RGB features, enabling selective incorporation of auxiliary modalities only when they provide additional cues while retaining RGB generalization.

Result: Achieves state-of-the-art performance on three benchmarks (DeLiVER, FMB, MUSES). Outperforms competing methods in both RGB-easy and RGB-hard subsets, demonstrating effectiveness in both favorable and adverse conditions.

Conclusion: MM SAM-adapter provides a balanced and efficient framework for multimodal semantic segmentation, effectively enhancing robustness through selective multimodal adaptation while maintaining strong RGB generalization capabilities.

Abstract: Semantic segmentation, a key task in computer vision with broad applications in autonomous driving, medical imaging, and robotics, has advanced substantially with deep learning. Nevertheless, current approaches remain vulnerable to challenging conditions such as poor lighting, occlusions, and adverse weather. To address these limitations, multimodal methods that integrate auxiliary sensor data (e.g., LiDAR, infrared) have recently emerged, providing complementary information that enhances robustness. In this work, we present MM SAM-adapter, a novel framework that extends the capabilities of the Segment Anything Model (SAM) for multimodal semantic segmentation. The proposed method employs an adapter network that injects fused multimodal features into SAM’s rich RGB features. This design enables the model to retain the strong generalization ability of RGB features while selectively incorporating auxiliary modalities only when they contribute additional cues. As a result, MM SAM-adapter achieves a balanced and efficient use of multimodal information. We evaluate our approach on three challenging benchmarks, DeLiVER, FMB, and MUSES, where MM SAM-adapter delivers state-of-the-art performance. To further analyze modality contributions, we partition DeLiVER and FMB into RGB-easy and RGB-hard subsets. Results consistently demonstrate that our framework outperforms competing methods in both favorable and adverse conditions, highlighting the effectiveness of multimodal adaptation for robust scene understanding. The code is available at the following link: https://github.com/iacopo97/Multimodal-SAM-Adapter.

[136] Ordinality of Visible-Thermal Image Intensities for Intrinsic Image Decomposition

Zeqing Leo Yuan, Mani Ramanagopal, Aswin C. Sankaranarayanan, Srinivasa G. Narasimhan

Main category: cs.CV

TL;DR: Training-free intrinsic image decomposition using visible-thermal image pairs, leveraging thermal absorption to derive ordinal constraints for self-supervised shading and reflectance recovery.

Details

Motivation: Lack of extensive ground-truth data for real-world intrinsic image decomposition, with existing methods relying on synthetic data or sparse annotations for limited scenes.

Method: Uses visible and thermal image pairs, leveraging that absorbed light becomes heat detected by thermal cameras. Relates ordinalities between visible/thermal intensities to shading/reflectance ordinalities to densely self-supervise neural network optimization.

Result: Superior performance over recent learning-based models in quantitative evaluations with known reflectance/shading under natural/artificial lighting, and qualitative experiments across diverse outdoor scenes.

Conclusion: Provides a scalable path to curating real-world ordinal supervision previously infeasible via manual labeling, demonstrating effective training-free intrinsic decomposition.

Abstract: Decomposing an image into its intrinsic photometric factors–shading and reflectance–is a long-standing challenge due to the lack of extensive ground-truth data for real-world scenes. Recent methods rely on synthetic data or sparse annotations for limited indoor and even fewer outdoor scenes. We introduce a novel training-free approach for intrinsic image decomposition using only a pair of visible and thermal images. We leverage the principle that light not reflected from an opaque surface is absorbed and detected as heat by a thermal camera. This allows us to relate the ordinalities between visible and thermal image intensities to the ordinalities of shading and reflectance, which can densely self-supervise an optimizing neural network to recover shading and reflectance. We perform quantitative evaluations with known reflectance and shading under natural and artificial lighting, and qualitative experiments across diverse outdoor scenes. The results demonstrate superior performance over recent learning-based models and point toward a scalable path to curating real-world ordinal supervision, previously infeasible via manual labeling.

[137] Compressed Video Quality Enhancement: Classifying and Benchmarking over Standards

Xiem HoangVan, Dang BuiDinh, Sang NguyenQuang, Wen-Hsiao Peng

Main category: cs.CV

TL;DR: This paper presents a comprehensive survey on compressed video quality enhancement (CVQE) methods, addressing limitations in existing surveys by providing a systematic taxonomy, unified benchmarking framework, and analysis of performance-complexity trade-offs.

Details

Motivation: Existing CVQE surveys lack systematic classification linking methods to specific standards and artifacts, insufficient comparative analysis across architectural paradigms, and underdeveloped benchmarking practices, creating gaps in the field.

Method: The paper introduces three key contributions: 1) a novel taxonomy classifying CVQE methods across architectural paradigms, coding standards, and compressed-domain feature utilization; 2) a unified benchmarking framework with modern compression protocols and standard test sequences; 3) systematic analysis of performance-complexity trade-offs.

Result: The comprehensive review establishes a foundation for consistent assessment and informed model selection in CVQE research and deployment, highlighting promising directions for future research.

Conclusion: This survey addresses critical gaps in CVQE literature by providing systematic classification, fair benchmarking practices, and analysis of trade-offs, which will benefit researchers and practitioners in selecting and developing effective video quality enhancement methods.

Abstract: Compressed video quality enhancement (CVQE) is crucial for improving user experience with lossy video codecs like H.264/AVC, H.265/HEVC, and H.266/VVC. While deep learning based CVQE has driven significant progress, existing surveys still suffer from limitations: lack of systematic classification linking methods to specific standards and artifacts, insufficient comparative analysis of architectural paradigms across coding types, and underdeveloped benchmarking practices. To address these gaps, this paper presents three key contributions. First, it introduces a novel taxonomy classifying CVQE methods across architectural paradigms, coding standards, and compressed-domain feature utilization. Second, it proposes a unified benchmarking framework integrating modern compression protocols and standard test sequences for fair multi-criteria evaluation. Third, it provides a systematic analysis of the critical trade-offs between reconstruction performance and computational complexity observed in state-of-the-art methods and highlighting promising directions for future research. This comprehensive review aims to establish a foundation for consistent assessment and informed model selection in CVQE research and deployment.

[138] InfGen: A Resolution-Agnostic Paradigm for Scalable Image Synthesis

Tao Han, Wanghan Xu, Junchao Gong, Xiaoyu Yue, Song Guo, Luping Zhou, Lei Bai

Main category: cs.CV

TL;DR: InfGen is a new method that replaces VAE decoders with a one-step generator to enable arbitrary resolution image generation from fixed-size latents, reducing 4K generation time from over 100 seconds to under 10 seconds.

Details

Motivation: Current diffusion models have quadratic computational complexity with resolution, making high-resolution image generation (like 4K) slow and resource-intensive, taking over 100 seconds.

Method: Uses a fixed latent from diffusion models as content representation and replaces VAE decoder with a compact one-step generator that can decode arbitrary resolution images from fixed-size latents without retraining diffusion models.

Result: InfGen reduces 4K image generation time to under 10 seconds while maintaining quality, and can be applied to any model using the same latent space.

Conclusion: InfGen enables efficient arbitrary resolution image generation, bringing many models into the high-resolution era with significantly reduced computational complexity and generation time.

Abstract: Arbitrary resolution image generation provides a consistent visual experience across devices, having extensive applications for producers and consumers. Current diffusion models increase computational demand quadratically with resolution, causing 4K image generation delays over 100 seconds. To solve this, we explore the second generation upon the latent diffusion models, where the fixed latent generated by diffusion models is regarded as the content representation and we propose to decode arbitrary resolution images with a compact generated latent using a one-step generator. Thus, we present the \textbf{InfGen}, replacing the VAE decoder with the new generator, for generating images at any resolution from a fixed-size latent without retraining the diffusion models, which simplifies the process, reducing computational complexity and can be applied to any model using the same latent space. Experiments show InfGen is capable of improving many models into the arbitrary high-resolution era while cutting 4K image generation time to under 10 seconds.

[139] SSL-AD: Spatiotemporal Self-Supervised Learning for Generalizability and Adaptability Across Alzheimer’s Prediction Tasks and Datasets

Emily Kaczmarek, Justin Szeto, Brennan Nichyporuk, Tal Arbel

Main category: cs.CV

TL;DR: Self-supervised learning model for Alzheimer’s prediction using 3D brain MRI with temporal order prediction and contrastive learning, outperforming supervised methods on most tasks while handling variable-length inputs and time intervals.

Details

Motivation: Address limitations of current deep learning models for Alzheimer's prediction including lack of labeled data, poor generalization across datasets, and inflexibility to varying numbers of input scans and time intervals.

Method: Adapted three state-of-the-art temporal SSL approaches for 3D brain MRI analysis with novel extensions for variable-length inputs and robust spatial features. Pre-trained on aggregated dataset of 3,161 patients from four public datasets.

Result: SSL model with temporal order prediction and contrastive learning outperformed supervised learning on 6 out of 7 downstream tasks including diagnosis classification, conversion detection, and future conversion prediction.

Conclusion: The SSL approach demonstrates superior adaptability and generalizability across tasks and varying input conditions, showing strong potential for robust clinical applications in Alzheimer’s disease prediction.

Abstract: Alzheimer’s disease is a progressive, neurodegenerative disorder that causes memory loss and cognitive decline. While there has been extensive research in applying deep learning models to Alzheimer’s prediction tasks, these models remain limited by lack of available labeled data, poor generalization across datasets, and inflexibility to varying numbers of input scans and time intervals between scans. In this study, we adapt three state-of-the-art temporal self-supervised learning (SSL) approaches for 3D brain MRI analysis, and add novel extensions designed to handle variable-length inputs and learn robust spatial features. We aggregate four publicly available datasets comprising 3,161 patients for pre-training, and show the performance of our model across multiple Alzheimer’s prediction tasks including diagnosis classification, conversion detection, and future conversion prediction. Importantly, our SSL model implemented with temporal order prediction and contrastive learning outperforms supervised learning on six out of seven downstream tasks. It demonstrates adaptability and generalizability across tasks and number of input images with varying time intervals, highlighting its capacity for robust performance across clinical applications. We release our code and model publicly at https://github.com/emilykaczmarek/SSL-AD.

[140] The Weighting Game: Evaluating Quality of Explainability Methods

Lassi Raatikainen, Esa Rahtu

Main category: cs.CV

TL;DR: This paper introduces two new metrics (Weighting Game and stability metric) to evaluate explanation heatmap quality for image classification, focusing on accuracy and stability of CAM methods across different model architectures.

Details

Motivation: To assess the quality of explanation heatmaps for image classification tasks through the lens of accuracy and stability, as current evaluation methods may be insufficient.

Method: Introduced the Weighting Game metric to measure how much class-guided explanation falls within correct class segmentation masks, and a stability metric using zooming/panning transformations to compare saliency maps with similar contents.

Result: Quantitative experiments evaluated commonly used CAM methods using these new metrics, revealing that explanation quality varies significantly between different model architectures.

Conclusion: Model architecture should be considered when choosing explainability methods, as different architectures produce explanations of varying quality according to the proposed accuracy and stability metrics.

Abstract: The objective of this paper is to assess the quality of explanation heatmaps for image classification tasks. To assess the quality of explainability methods, we approach the task through the lens of accuracy and stability. In this work, we make the following contributions. Firstly, we introduce the Weighting Game, which measures how much of a class-guided explanation is contained within the correct class’ segmentation mask. Secondly, we introduce a metric for explanation stability, using zooming/panning transformations to measure differences between saliency maps with similar contents. Quantitative experiments are produced, using these new metrics, to evaluate the quality of explanations provided by commonly used CAM methods. The quality of explanations is also contrasted between different model architectures, with findings highlighting the need to consider model architecture when choosing an explainability method.

[141] GeoDE: a Geographically Diverse Evaluation Dataset for Object Recognition

Vikram V. Ramaswamy, Sing Yu Lin, Dora Zhao, Aaron B. Adcock, Laurens van der Maaten, Deepti Ghadiyaram, Olga Russakovsky

Main category: cs.CV

TL;DR: GeoDE is a geographically diverse image dataset collected by soliciting images from people worldwide, addressing biases in web-scraped datasets. It contains 61,940 images from 40 classes across 6 world regions with no personally identifiable information.

Details

Motivation: Current web-scraped datasets reinforce stereotypical biases, contain personal information, and are predominantly from Europe/North America. The authors aim to create a more geographically diverse and ethically collected dataset.

Method: Collected 61,940 images by soliciting contributions from people around the world across 40 classes and 6 world regions, avoiding web scraping. Analyzed differences compared to web-scraped data and demonstrated its use for evaluation and training.

Result: Created GeoDE dataset that is geographically diverse, contains no personally identifiable information, and helps highlight and mitigate shortcomings in current computer vision models despite its relatively small size.

Conclusion: GeoDE provides an alternative dataset collection paradigm that addresses geographical bias and privacy concerns in current datasets, serving as both evaluation and training data to improve model fairness and diversity.

Abstract: Current dataset collection methods typically scrape large amounts of data from the web. While this technique is extremely scalable, data collected in this way tends to reinforce stereotypical biases, can contain personally identifiable information, and typically originates from Europe and North America. In this work, we rethink the dataset collection paradigm and introduce GeoDE, a geographically diverse dataset with 61,940 images from 40 classes and 6 world regions, with no personally identifiable information, collected by soliciting images from people around the world. We analyse GeoDE to understand differences in images collected in this manner compared to web-scraping. We demonstrate its use as both an evaluation and training dataset, allowing us to highlight and begin to mitigate the shortcomings in current models, despite GeoDE’s relatively small size. We release the full dataset and code at https://geodiverse-data-collection.cs.princeton.edu

[142] TSGCNeXt: Dynamic-Static Multi-Graph Convolution for Efficient Skeleton-Based Action Recognition with Long-term Learning Potential

Dongjingdin Liu, Pengpeng Chen, Miao Yao, Yijing Lu, Zijie Cai, Yuxin Tian

Main category: cs.CV

TL;DR: TSGCNeXt proposes an efficient graph convolutional network for skeleton-based action recognition, addressing redundant training and long temporal sequence bottlenecks with dynamic-static graph convolution and acceleration mechanisms.

Details

Motivation: Current GCN approaches for skeleton action recognition tend to be complex with redundant training and struggle with long time-series data, requiring more efficient learning mechanisms.

Method: Proposes Dynamic-Static Separate Multi-graph Convolution (DS-SMG) to aggregate multiple topological graphs, a graph convolution acceleration mechanism for 55.08% speed-up, and three spatio-temporal learning modules for long temporal modeling.

Result: Outperforms previous single-stream networks on NTU RGB+D 60 and 120 datasets. With EMA model in multi-stream fusion, achieves SOTA with 90.22% (cross-subject) and 91.74% (cross-set) accuracy on NTU 120.

Conclusion: TSGCNeXt provides an efficient solution for long temporal skeleton sequence learning, achieving state-of-the-art performance while optimizing training speed and computational efficiency.

Abstract: Skeleton-based action recognition has achieved remarkable results in human action recognition with the development of graph convolutional networks (GCNs). However, the recent works tend to construct complex learning mechanisms with redundant training and exist a bottleneck for long time-series. To solve these problems, we propose the Temporal-Spatio Graph ConvNeXt (TSGCNeXt) to explore efficient learning mechanism of long temporal skeleton sequences. Firstly, a new graph learning mechanism with simple structure, Dynamic-Static Separate Multi-graph Convolution (DS-SMG) is proposed to aggregate features of multiple independent topological graphs and avoid the node information being ignored during dynamic convolution. Next, we construct a graph convolution training acceleration mechanism to optimize the back-propagation computing of dynamic graph learning with 55.08% speed-up. Finally, the TSGCNeXt restructure the overall structure of GCN with three Spatio-temporal learning modules,efficiently modeling long temporal features. In comparison with existing previous methods on large-scale datasets NTU RGB+D 60 and 120, TSGCNeXt outperforms on single-stream networks. In addition, with the ema model introduced into the multi-stream fusion, TSGCNeXt achieves SOTA levels. On the cross-subject and cross-set of the NTU 120, accuracies reach 90.22% and 91.74%.

[143] Evaluating the Evaluators: Towards Human-aligned Metrics for Missing Markers Reconstruction

Taras Kucherenko, Derek Peristy, Judith Bütepage

Main category: cs.CV

TL;DR: Paper shows MSE metric doesn’t correlate with subjective perception of animation marker reconstruction quality, proposes better-correlated metrics for the field.

Details

Motivation: Optical motion capture systems often have missing markers due to errors/occlusions, requiring time-consuming manual cleaning. Machine learning solutions exist but use inadequate metrics.

Method: Introduces and evaluates a set of new metrics that better correlate with subjective perception of fill quality compared to traditional mean square error.

Result: Demonstrates that mean square error does not align with human subjective assessment of reconstruction quality, while the proposed metrics show better correlation.

Conclusion: Proposes improved metrics for missing marker reconstruction that can drive better progress in the field by aligning with human perception rather than relying on simplistic MSE.

Abstract: Animation data is often obtained through optical motion capture systems, which utilize a multitude of cameras to establish the position of optical markers. However, system errors or occlusions can result in missing markers, the manual cleaning of which can be time-consuming. This has sparked interest in machine learning-based solutions for missing marker reconstruction in the academic community. Most academic papers utilize a simplistic mean square error as the main metric. In this paper, we show that this metric does not correlate with subjective perception of the fill quality. Additionally, we introduce and evaluate a set of better-correlated metrics that can drive progress in the field.

[144] Your Image is Secretly the Last Frame of a Pseudo Video

Wenlong Chen, Wenlin Chen, Lapo Rastrelli, Yingzhen Li

Main category: cs.CV

TL;DR: The paper hypothesizes that diffusion models’ success comes from self-supervision via corrupted images forming pseudo videos, and shows this approach can improve other generative models.

Details

Motivation: Diffusion models outperform standard HVAEs in image quality, possibly due to additional self-supervision from corrupted images forming pseudo videos. The authors want to apply this concept to improve other generative models.

Method: Extend image generative models to video generative models, train them on pseudo videos created by applying data augmentation to original images. Analyze issues with first-order Markov augmentation and propose more expressive augmentation methods.

Result: Empirical results on CIFAR10 and CelebA datasets show improved image generation quality with additional self-supervised information from pseudo videos.

Conclusion: The self-supervision provided by pseudo videos (corrupted images) can enhance image generation quality in various generative models, not just diffusion models, and more expressive data augmentation yields better results.

Abstract: Diffusion models, which can be viewed as a special case of hierarchical variational autoencoders (HVAEs), have shown profound success in generating photo-realistic images. In contrast, standard HVAEs often produce images of inferior quality compared to diffusion models. In this paper, we hypothesize that the success of diffusion models can be partly attributed to the additional self-supervision information for their intermediate latent states provided by corrupted images, which along with the original image form a pseudo video. Based on this hypothesis, we explore the possibility of improving other types of generative models with such pseudo videos. Specifically, we first extend a given image generative model to their video generative model counterpart, and then train the video generative model on pseudo videos constructed by applying data augmentation to the original images. Furthermore, we analyze the potential issues of first-order Markov data augmentation methods, which are typically used in diffusion models, and propose to use more expressive data augmentation to construct more useful information in pseudo videos. Our empirical results on the CIFAR10 and CelebA datasets demonstrate that improved image generation quality can be achieved with additional self-supervised information from pseudo videos.

[145] AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion models

Yaopei Zeng, Yuanpu Cao, Bochuan Cao, Yurui Chang, Jinghui Chen, Lu Lin

Main category: cs.CV

TL;DR: AdvI2I framework uses adversarial image attacks to bypass safety filters in Image-to-Image diffusion models and generate NSFW content without modifying text prompts.

Details

Motivation: Existing text-based adversarial prompts for generating NSFW content are easily detectable by filters, but image-based attacks on I2I diffusion models remain unexplored and could bypass current defenses.

Method: Proposes AdvI2I framework that optimizes a generator to craft adversarial input images that induce diffusion models to generate NSFW content. Also introduces AdvI2I-Adaptive version that adapts to countermeasures and minimizes resemblance to NSFW embeddings.

Result: Both AdvI2I and AdvI2I-Adaptive effectively bypass current safety mechanisms like Safe Latent Diffusion (SLD) without altering text prompts.

Conclusion: Reveals a critical vulnerability in I2I diffusion models, demonstrating that image-based attacks can circumvent existing safeguards, highlighting the urgent need for stronger security measures.

Abstract: Recent advances in diffusion models have significantly enhanced the quality of image synthesis, yet they have also introduced serious safety concerns, particularly the generation of Not Safe for Work (NSFW) content. Previous research has demonstrated that adversarial prompts can be used to generate NSFW content. However, such adversarial text prompts are often easily detectable by text-based filters, limiting their efficacy. In this paper, we expose a previously overlooked vulnerability: adversarial image attacks targeting Image-to-Image (I2I) diffusion models. We propose AdvI2I, a novel framework that manipulates input images to induce diffusion models to generate NSFW content. By optimizing a generator to craft adversarial images, AdvI2I circumvents existing defense mechanisms, such as Safe Latent Diffusion (SLD), without altering the text prompts. Furthermore, we introduce AdvI2I-Adaptive, an enhanced version that adapts to potential countermeasures and minimizes the resemblance between adversarial images and NSFW concept embeddings, making the attack more resilient against defenses. Through extensive experiments, we demonstrate that both AdvI2I and AdvI2I-Adaptive can effectively bypass current safeguards, highlighting the urgent need for stronger security measures to address the misuse of I2I diffusion models.

[146] LoFi: Vision-Aided Label Generator for Wi-Fi Localization and Tracking

Zijian Zhao, Tingwei Chen, Fanyi Meng, Zhijie Cai, Hang Li, Xiaoyang Li, Guangxu Zhu

Main category: cs.CV

TL;DR: LoFi is a vision-aided system that generates precise ground truth position coordinates from 2D images for Wi-Fi localization and tracking, addressing the data labeling challenges in data-driven approaches.

Details

Motivation: Existing data collection methods for Wi-Fi localization provide only coarse-grained ground truth or limited labeled points, while precise systems like lidar are too expensive for widespread use.

Method: LoFi uses 2D images from a webcam to generate high-precision ground truth position coordinates, working with ESP32-S3 Wi-Fi devices to create localization datasets.

Result: The system successfully compiled a Wi-Fi tracking and localization dataset with high precision, low cost, and ease of use, making precise ground truth generation accessible.

Conclusion: LoFi provides an effective solution for generating precise ground truth data for Wi-Fi localization systems using affordable vision-based technology, enabling better advancement of data-driven approaches.

Abstract: Data-driven Wi-Fi localization and tracking have shown great promise due to their lower reliance on specialized hardware compared to model-based methods. However, most existing data collection techniques provide only coarse-grained ground truth or a limited number of labeled points, significantly hindering the advancement of data-driven approaches. While systems like lidar can deliver precise ground truth, their high costs make them inaccessible to many users. To address these challenges, we propose LoFi, a vision-aided label generator for Wi-Fi localization and tracking. LoFi can generate ground truth position coordinates solely from 2D images, offering high precision, low cost, and ease of use. Utilizing our method, we have compiled a Wi-Fi tracking and localization dataset using the ESP32-S3 and a webcam. The code and dataset of this paper are available at https://github.com/RS2002/LoFi.

[147] MoPD: Mixture-of-Prompts Distillation for Vision-Language Models

Yang Chen, Shuai Fu, Yu Zhang

Main category: cs.CV

TL;DR: MoPD uses prompt distillation from hard teacher prompts to soft student prompts with a gating network to improve generalization on unseen classes in vision-language models.

Details

Motivation: Existing soft prompt learning methods overfit seen classes and perform poorly on unseen classes due to training data bias towards seen classes.

Method: Mixture-of-Prompts Distillation (MoPD) transfers knowledge from manually crafted hard prompts to learnable soft prompts using a gating network that selects appropriate hard prompts for distillation.

Result: Extensive experiments show MoPD outperforms state-of-the-art baselines, particularly on unseen classes.

Conclusion: MoPD effectively enhances soft prompt generalization by leveraging knowledge distillation from hard prompts, addressing the overfitting issue in vision-language models.

Abstract: Soft prompt learning methods are effective for adapting vision-language models (VLMs) to downstream tasks. Nevertheless, empirical evidence reveals a tendency of existing methods that they overfit seen classes and exhibit degraded performance on unseen classes. This limitation is due to the inherent bias in the training data towards the seen classes. To address this issue, we propose a novel soft prompt learning method, named Mixture-of-Prompts Distillation (MoPD), which can effectively transfer useful knowledge from hard prompts manually hand-crafted (a.k.a. teacher prompts) to the learnable soft prompt (a.k.a. student prompt), thereby enhancing the generalization ability of soft prompts on unseen classes. Moreover, the proposed MoPD method utilizes a gating network that learns to select hard prompts used for prompt distillation. Extensive experiments demonstrate that the proposed MoPD method outperforms state-of-the-art baselines especially on on unseen classes.

Nikolaos Dionelis, Alessandra Feliciotti, Mattia Marconcini, Devis Peressutti, Nika Oman Kadunc, JaeWan Park, Hagai Raja Sinulingga, Steve Andreas Immanuel, Ba Tran, Caroline Arnold, Nicolas Longépé

Main category: cs.CV

TL;DR: MapYourCity is a multi-modal benchmark dataset for building age estimation using satellite and street-view imagery, with a 7-class classification task covering 1900-present periods. Top models from a 2024 challenge show effective generalization to unseen cities and robustness to missing street-view data.

Details

Motivation: Accurate building age estimation is critical for sustainable urban planning, as older buildings often lack energy-efficient features. This helps reduce energy consumption and mitigate climate change.

Method: Created MapYourCity dataset with VHR satellite imagery, Sentinel-2 multi-spectral data, and street-view images across European cities. Organized community challenge with 7-class classification (1900-present). Evaluated top models on generalization to unseen cities and missing modality scenarios.

Result: Building age estimation is feasible and effective even in unseen cities and when relying solely on top-view satellite imagery (VHR + Sentinel-2). Models demonstrated strong performance without street-view data.

Conclusion: MapYourCity provides a valuable resource for developing scalable solutions in sustainable urban analytics, showing that satellite imagery alone can effectively estimate building construction years for energy efficiency planning.

Abstract: Estimating the construction year of buildings is critical for advancing sustainability, as older structures often lack energy-efficient features. Sustainable urban planning relies on accurate building age data to reduce energy consumption and mitigate climate change. In this work, we introduce MapYourCity, a novel multi-modal benchmark dataset comprising top-view Very High Resolution (VHR) imagery, multi-spectral Earth Observation (EO) data from the Copernicus Sentinel-2 satellite constellation, and co-localized street-view images across various European cities. Each building is labeled with its construction epoch, and the task is formulated as a seven-class classification problem covering periods from 1900 to the present. To advance research in EO generalization and multi-modal learning, we organized a community-driven data challenge in 2024, hosted by ESA $\Phi$-lab, which ran for four months and attracted wide participation. This paper presents the Top-4 performing models from the challenge and their evaluation results. We assess model generalization on cities excluded from training to prevent data leakage, and evaluate performance under missing modality scenarios, particularly when street-view data is unavailable. Results demonstrate that building age estimation is both feasible and effective, even in previously unseen cities and when relying solely on top-view satellite imagery (i.e. with VHR and Sentinel-2 images). The MapYourCity dataset thus provides a valuable resource for developing scalable, real-world solutions in sustainable urban analytics.

[149] Afford-X: Generalizable and Slim Affordance Reasoning for Task-oriented Manipulation

Xiaomeng Zhu, Yuyang Li, Leiyao Cui, Pengfei Li, Huan-ang Gao, Yixin Zhu, Hao Zhao

Main category: cs.CV

TL;DR: LVIS-Aff dataset and Afford-X model improve object affordance reasoning with 12.1% performance gain over non-LLM methods while being compact and fast for robotic applications.

Details

Motivation: Current computational models for affordance reasoning lack generalizability, and LLMs are difficult to deploy on local devices for task-oriented manipulations.

Method: Created LVIS-Aff dataset with 1,496 tasks and 119k images, developed Afford-X model with Verb Attention and Bi-Fusion modules for multi-modal understanding.

Result: Achieved 12.1% performance improvement over non-LLM methods, 1.2% improvement over previous work, with 187M parameters and 50x faster inference than GPT-4V API.

Conclusion: Demonstrates potential for efficient, generalizable affordance reasoning models deployable on local devices for task-oriented robotic manipulations.

Abstract: Object affordance reasoning, the ability to infer object functionalities based on physical properties, is fundamental for task-oriented planning and activities in both humans and Artificial Intelligence (AI). This capability, required for planning and executing daily activities in a task-oriented manner, relies on commonsense knowledge of object physics and functionalities, extending beyond simple object recognition. Current computational models for affordance reasoning from perception lack generalizability, limiting their applicability in novel scenarios. Meanwhile, comprehensive Large Language Models (LLMs) with emerging reasoning capabilities are challenging to deploy on local devices for task-oriented manipulations. Here, we introduce LVIS-Aff, a large-scale dataset comprising 1,496 tasks and 119k images, designed to enhance the generalizability of affordance reasoning from perception. Utilizing this dataset, we develop Afford-X, an end-to-end trainable affordance reasoning model that incorporates Verb Attention and Bi-Fusion modules to improve multi-modal understanding. This model achieves up to a 12.1% performance improvement over the best-reported results from non-LLM methods, while also demonstrating a 1.2% enhancement compared to our previous conference paper. Additionally, it maintains a compact 187M parameter size and infers nearly 50 times faster than the GPT-4V API. Our work demonstrates the potential for efficient, generalizable affordance reasoning models that can be deployed on local devices for task-oriented manipulations. We showcase Afford-X’s effectiveness in enabling task-oriented manipulations for robots across various tasks and environments, underscoring its efficiency and broad implications for advancing robotics and AI systems in real-world applications.

[150] Can Generative Geospatial Diffusion Models Excel as Discriminative Geospatial Foundation Models?

Yuru Jia, Valerio Marsocci, Ziyang Gong, Xue Yang, Maarten Vergauwen, Andrea Nascetti

Main category: cs.CV

TL;DR: SatDiFuser transforms diffusion-based generative models into discriminative foundation models for remote sensing, achieving state-of-the-art performance through multi-stage feature fusion.

Details

Motivation: Current geospatial foundation models primarily use contrastive learning or masked image modeling, while generative diffusion models remain underexplored for discriminative tasks despite their potential to capture multi-grained semantics essential for remote sensing.

Method: Developed SatDiFuser framework that systematically analyzes multi-stage, noise-dependent diffusion features and implements three fusion strategies to leverage diverse representations from diffusion-based generative models.

Result: Outperforms state-of-the-art GFMs with gains of up to +5.7% mIoU in semantic segmentation and +7.9% F1-score in classification on remote sensing benchmarks.

Conclusion: Demonstrates that diffusion-based generative foundation models can rival or exceed discriminative GFMs, opening new possibilities for using generative models in discriminative remote sensing applications.

Abstract: Self-supervised learning (SSL) has revolutionized representation learning in Remote Sensing (RS), advancing Geospatial Foundation Models (GFMs) to leverage vast unlabeled satellite imagery for diverse downstream tasks. Currently, GFMs primarily employ objectives like contrastive learning or masked image modeling, owing to their proven success in learning transferable representations. However, generative diffusion models, which demonstrate the potential to capture multi-grained semantics essential for RS tasks during image generation, remain underexplored for discriminative applications. This prompts the question: can generative diffusion models also excel and serve as GFMs with sufficient discriminative power? In this work, we answer this question with SatDiFuser, a framework that transforms a diffusion-based generative geospatial foundation model into a powerful pretraining tool for discriminative RS. By systematically analyzing multi-stage, noise-dependent diffusion features, we develop three fusion strategies to effectively leverage these diverse representations. Extensive experiments on remote sensing benchmarks show that SatDiFuser outperforms state-of-the-art GFMs, achieving gains of up to +5.7% mIoU in semantic segmentation and +7.9% F1-score in classification, demonstrating the capacity of diffusion-based generative foundation models to rival or exceed discriminative GFMs. The source code is available at: https://github.com/yurujaja/SatDiFuser.

[151] SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation

Xiaofu Chen, Israfel Salazar, Yova Kementchedjhieva

Main category: cs.CV

TL;DR: SPECS is a new efficient reference-free metric for long image caption evaluation that matches LLM-based metrics in human correlation while being much faster, making it practical for iterative model development.

Details

Motivation: Existing evaluation metrics for long image captions are unreliable - n-gram metrics lack semantic understanding, representational similarity metrics have low human correlation, and LLM-based metrics are too expensive for iterative use.

Method: SPECS modifies CLIP with a new objective that emphasizes specificity, rewarding correct details and penalizing incorrect ones in a reference-free representational similarity approach.

Result: SPECS matches the performance of open-source LLM-based metrics in correlation to human judgments while being far more computationally efficient.

Conclusion: SPECS provides a practical and efficient alternative for iterative checkpoint evaluation during image captioning model development, balancing accuracy with computational feasibility.

Abstract: As interest grows in generating long, detailed image captions, standard evaluation metrics become increasingly unreliable. N-gram-based metrics though efficient, fail to capture semantic correctness. Representational Similarity (RS) metrics, designed to address this, initially saw limited use due to high computational costs, while today, despite advances in hardware, they remain unpopular due to low correlation to human judgments. Meanwhile, metrics based on large language models (LLMs) show strong correlation with human judgments, but remain too expensive for iterative use during model development. We introduce SPECS (Specificity-Enhanced CLIPScore), a reference-free RS metric tailored to long image captioning. SPECS modifies CLIP with a new objective that emphasizes specificity: rewarding correct details and penalizing incorrect ones. We show that SPECS matches the performance of open-source LLM-based metrics in correlation to human judgments, while being far more efficient. This makes it a practical alternative for iterative checkpoint evaluation during image captioning model development.Our code can be found at https://github.com/mbzuai-nlp/SPECS.

[152] Talk2PC: Enhancing 3D Visual Grounding through LiDAR and Radar Point Clouds Fusion for Autonomous Driving

Runwei Guan, Jianan Liu, Ningwei Ouyang, Shaofeng Liang, Daizong Liu, Xiaolou Sun, Lianqing Zheng, Ming Xu, Yutao Yue, Guoqiang Mao, Hui Xiong

Main category: cs.CV

TL;DR: TPCNet is the first outdoor 3D visual grounding model that combines LiDAR and radar sensors using prompt-guided fusion for more accurate 3D scene understanding in autonomous driving.

Details

Motivation: Existing 3D understanding relies on 2D Vision-Language Models with limited scene context, while point cloud sensors (LiDAR) and 4D radar provide richer 3D representations and motion information that can better support natural language queries for visual grounding.

Method: Proposes TPCNet with Two-Stage Heterogeneous Modal Adaptive Fusion: 1) Bidirectional Agent Cross-Attention (BACA) for querying both sensor features with global receptive fields, 2) Dynamic Gated Graph Fusion (DGGF) to locate regions of interest, and 3) C3D-RECHead based on nearest object edge to ego-vehicle for enhanced accuracy.

Result: Achieves state-of-the-art performance on both Talk2Radar and Talk2Car datasets, demonstrating superior 3D visual grounding capabilities.

Conclusion: The integration of LiDAR and radar sensors through prompt-guided fusion enables more accurate and flexible 3D visual grounding for autonomous driving applications, with TPCNet setting new benchmarks in outdoor scene understanding.

Abstract: Embodied outdoor scene understanding forms the foundation for autonomous agents to perceive, analyze, and react to dynamic driving environments. However, existing 3D understanding is predominantly based on 2D Vision-Language Models (VLMs), which collect and process limited scene-aware contexts. In contrast, compared to the 2D planar visual information, point cloud sensors such as LiDAR provide rich depth and fine-grained 3D representations of objects. Even better the emerging 4D millimeter-wave radar detects the motion trend, velocity, and reflection intensity of each object. The integration of these two modalities provides more flexible querying conditions for natural language, thereby supporting more accurate 3D visual grounding. To this end, we propose a novel method called TPCNet, the first outdoor 3D visual grounding model upon the paradigm of prompt-guided point cloud sensor combination, including both LiDAR and radar sensors. To optimally combine the features of these two sensors required by the prompt, we design a multi-fusion paradigm called Two-Stage Heterogeneous Modal Adaptive Fusion. Specifically, this paradigm initially employs Bidirectional Agent Cross-Attention (BACA), which feeds both-sensor features, characterized by global receptive fields, to the text features for querying. Moreover, we design a Dynamic Gated Graph Fusion (DGGF) module to locate the regions of interest identified by the queries. To further enhance accuracy, we devise an C3D-RECHead, based on the nearest object edge to the ego-vehicle. Experimental results demonstrate that our TPCNet, along with its individual modules, achieves the state-of-the-art performance on both the Talk2Radar and Talk2Car datasets. We release the code at https://github.com/GuanRunwei/TPCNet.

[153] JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

Muyao Li, Zihao Wang, Kaichen He, Xiaojian Ma, Yitao Liang

Main category: cs.CV

TL;DR: A novel post-training approach for Visual Language Action models that enhances world knowledge, visual recognition, and spatial grounding through self-supervised visual and linguistic guidance, achieving 40% improvement over baselines in Minecraft tasks.

Details

Motivation: Previous work focused on action post-training while neglecting enhancements to the foundational Visual Language Models themselves, limiting their capabilities in open-world decision-making tasks.

Method: Act from Visual Language Post-Training approach that refines VLMs through visual and linguistic guidance in a self-supervised manner, using non-trajectory tasks for post-training.

Result: Achieved first VLA models in Minecraft capable of following human instructions on over 1,000 atomic tasks (crafting, smelting, cooking, mining, killing) with 40% improvement over best agent baseline, surpassing traditional imitation learning-based policies.

Conclusion: The approach demonstrates state-of-the-art performance in Minecraft, showing that post-training on non-trajectory tasks significantly enhances VLA model capabilities for open-world decision-making, with code/models/datasets open-sourced for further research.

Abstract: Recently, action-based decision-making in open-world environments has gained significant attention. Visual Language Action (VLA) models, pretrained on large-scale web datasets, have shown promise in decision-making tasks. However, previous work has primarily focused on action post-training, often neglecting enhancements to the foundational model itself. In response, we introduce a novel approach, Act from Visual Language Post-Training, which refines Visual Language Models (VLMs) through visual and linguistic guidance in a self-supervised manner. This enhancement improves the models’ capabilities in world knowledge, visual recognition, and spatial grounding in open-world environments. Following the above post-training paradigms, we obtain the first VLA models in Minecraft that can follow human instructions on over 1k different atomic tasks, including crafting, smelting, cooking, mining, and killing. Our experiments demonstrate that post-training on non-trajectory tasks leads to a significant 40% improvement over the best agent baseline on a diverse set of atomic tasks. Furthermore, we demonstrate that our approach surpasses traditional imitation learning-based policies in Minecraft, achieving state-of-the-art performance. We have open-sourced the code, models, and datasets to foster further research. The project page can be found in https://craftjarvis.github.io/JarvisVLA.

[154] Dynamic Motion Blending for Versatile Motion Editing

Nan Jiang, Hongjie Li, Ziye Yuan, Zimo He, Yixin Chen, Tengyu Liu, Yixin Zhu, Siyuan Huang

Main category: cs.CV

TL;DR: MotionReFit is a text-guided motion editing method that uses MotionCutMix for data augmentation and an auto-regressive diffusion model with motion coordinator to handle diverse editing scenarios without relying on pre-collected triplets or LLMs.

Details

Motivation: Existing text-guided motion editing methods are limited by pre-collected training triplets, which restricts their versatility in diverse editing scenarios. The need for more flexible and comprehensive training data motivates this work.

Method: Proposes MotionCutMix for online data augmentation by blending body part motions based on text, and MotionReFit - an auto-regressive diffusion model with motion coordinator to handle the increased randomness from compositional data generation.

Result: Achieves state-of-the-art performance in text-guided motion editing through extensive experiments, handling both spatial and temporal edits directly from human instructions without additional specifications.

Conclusion: The combination of MotionCutMix data augmentation and MotionReFit diffusion model effectively addresses the limitations of existing methods, enabling versatile text-guided motion editing without relying on pre-collected triplets or large language models.

Abstract: Text-guided motion editing enables high-level semantic control and iterative modifications beyond traditional keyframe animation. Existing methods rely on limited pre-collected training triplets, which severely hinders their versatility in diverse editing scenarios. We introduce MotionCutMix, an online data augmentation technique that dynamically generates training triplets by blending body part motions based on input text. While MotionCutMix effectively expands the training distribution, the compositional nature introduces increased randomness and potential body part incoordination. To model such a rich distribution, we present MotionReFit, an auto-regressive diffusion model with a motion coordinator. The auto-regressive architecture facilitates learning by decomposing long sequences, while the motion coordinator mitigates the artifacts of motion composition. Our method handles both spatial and temporal motion edits directly from high-level human instructions, without relying on additional specifications or Large Language Models. Through extensive experiments, we show that MotionReFit achieves state-of-the-art performance in text-guided motion editing.

[155] GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill

Jieming Cui, Tengyu Liu, Ziyu Meng, Jiale Yu, Ran Song, Wei Zhang, Yixin Zhu, Siyuan Huang

Main category: cs.CV

TL;DR: GROVE is a generalized reward framework that uses LLMs and VLMs for open-vocabulary physical skill learning without manual rewards or demonstrations, achieving better performance and faster training.

Details

Motivation: Current reinforcement learning approaches have limitations: manual rewards lack scalability across diverse tasks, and demonstration-based methods struggle to generalize beyond training distribution.

Method: Uses LLMs to generate precise physical constraints and VLMs to evaluate motion semantics. Includes iterative refinement process and Pose2CLIP mapper to project agent poses directly into semantic feature space without expensive rendering.

Result: Achieved 22.2% higher motion naturalness, 25.7% better task completion scores, and trained 8.4x faster than previous methods across diverse embodiments and learning paradigms.

Conclusion: Establishes a new foundation for scalable physical skill acquisition in simulated environments through complementary LLM and VLM guidance with efficient pose-to-feature mapping.

Abstract: Learning open-vocabulary physical skills for simulated agents presents a significant challenge in artificial intelligence. Current reinforcement learning approaches face critical limitations: manually designed rewards lack scalability across diverse tasks, while demonstration-based methods struggle to generalize beyond their training distribution. We introduce GROVE, a generalized reward framework that enables open-vocabulary physical skill learning without manual engineering or task-specific demonstrations. Our key insight is that Large Language Models(LLMs) and Vision Language Models(VLMs) provide complementary guidance – LLMs generate precise physical constraints capturing task requirements, while VLMs evaluate motion semantics and naturalness. Through an iterative design process, VLM-based feedback continuously refines LLM-generated constraints, creating a self-improving reward system. To bridge the domain gap between simulation and natural images, we develop Pose2CLIP, a lightweight mapper that efficiently projects agent poses directly into semantic feature space without computationally expensive rendering. Extensive experiments across diverse embodiments and learning paradigms demonstrate GROVE’s effectiveness, achieving 22.2% higher motion naturalness and 25.7% better task completion scores while training 8.4x faster than previous methods. These results establish a new foundation for scalable physical skill acquisition in simulated environments.

[156] MedM-VL: What Makes a Good Medical LVLM?

Yiming Shi, Shaoshuai Yang, Xun Zhu, Haoyu Wang, Xiangling Fu, Miao Li, Ji Wu

Main category: cs.CV

TL;DR: This paper presents MedM-VL, a modular framework for building medical vision-language models based on LLaVA, with specialized models for 2D and 3D medical image analysis tasks.

Details

Motivation: Traditional task-specific models struggle with complex medical multimodal tasks like report generation and visual question answering, while large vision-language models offer promising solutions that need systematic exploration.

Method: Built on LLaVA framework, systematically explored model architectures and training strategies for both 2D and 3D medical LVLMs, releasing a modular codebase and two pre-trained models.

Result: Developed MedM-VL framework with MedM-VL-2D for 2D medical image analysis and MedM-VL-CT-Chest for 3D CT-based applications, providing extensive empirical findings and practical guidance.

Conclusion: The study offers a comprehensive approach to medical LVLMs with reproducible code and pre-trained models to support future research in medical multimodal tasks.

Abstract: Medical image analysis is essential in modern healthcare. Deep learning has redirected research focus toward complex medical multimodal tasks, including report generation and visual question answering. Traditional task-specific models often fall short in handling these challenges. Large vision-language models (LVLMs) offer new solutions for solving such tasks. In this study, we build on the popular LLaVA framework to systematically explore model architectures and training strategies for both 2D and 3D medical LVLMs. We present extensive empirical findings and practical guidance. To support reproducibility and future research, we release a modular codebase, MedM-VL, and two pre-trained models: MedM-VL-2D for 2D medical image analysis and MedM-VL-CT-Chest for 3D CT-based applications. The code is available at: https://github.com/MSIIP/MedM-VL

[157] Just Say the Word: Annotation-Free Fine-Grained Object Counting

Adriano D’Alessandro, Ali Mahdavi-Amiri, Ghassan Hamarneh

Main category: cs.CV

TL;DR: A method to improve fine-grained object counting by tuning concept embeddings using synthetic images from diffusion models, refining overcounts without real images or human annotations.

Details

Motivation: Fine-grained object counting is challenging for class-agnostic models which often overcount visually similar instances. Traditional approaches require time-consuming annotation and retraining without guaranteeing generalization to novel categories.

Method: Propose tuning compact concept embeddings derived from category prompts using synthetic images and pseudo-labels generated by text-to-image diffusion models. These embeddings condition a specialization module that refines raw overcounts from frozen counters.

Result: Validated on Lookalikes benchmark (1,037 images across 27 fine-grained subcategories), showing substantial improvements over strong baselines.

Conclusion: The approach provides accurate category-specific counting without requiring real images or human annotations, offering an efficient alternative to traditional retraining methods.

Abstract: Fine-grained object counting remains a major challenge for class-agnostic counting models, which overcount visually similar but incorrect instances (e.g., jalape~no vs. poblano). Addressing this by annotating new data and fully retraining the model is time-consuming and does not guarantee generalization to additional novel categories at test time. Instead, we propose an alternative paradigm: Given a category name, tune a compact concept embedding derived from the prompt using synthetic images and pseudo-labels generated by a text-to-image diffusion model. This embedding conditions a specialization module that refines raw overcounts from any frozen counter into accurate, category-specific estimates\textemdash without requiring real images or human annotations. We validate our approach on \textsc{Lookalikes}, a challenging new benchmark containing 1,037 images across 27 fine-grained subcategories, and show substantial improvements over strong baselines. Code will be released upon acceptance. Dataset - https://dalessandro.dev/datasets/lookalikes/

[158] Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation

Hongji Yang, Yucheng Zhou, Wencheng Han, Jianbing Shen

Main category: cs.CV

TL;DR: A novel prompt optimization framework that uses large vision language models (LVLMs) to rewrite simple user prompts into sophisticated ones and provide AI feedback for reinforcement learning-based self-improvement.

Details

Motivation: Text-to-image models require specialized vocabulary for effective prompts, and existing methods depend on large amounts of manual annotations and trained aesthetic assessment models, which introduces biases and data dependency issues.

Method: Uses LVLMs as both solver (to rewrite prompts) and reward model (to score image aesthetics and alignment). Employs reinforcement learning with AI feedback instead of human feedback, unifying solver and reward model into one model for iterative self-improvement.

Result: Outperforms other strong competitors on two popular datasets.

Conclusion: The proposed framework effectively optimizes text prompts for text-to-image models using LVLMs with AI feedback, reducing dependence on manual annotations and trained models while achieving superior performance.

Abstract: Text-to-image models are powerful for producing high-quality images based on given text prompts, but crafting these prompts often requires specialized vocabulary. To address this, existing methods train rewriting models with supervision from large amounts of manually annotated data and trained aesthetic assessment models. To alleviate the dependence on data scale for model training and the biases introduced by trained models, we propose a novel prompt optimization framework, designed to rephrase a simple user prompt into a sophisticated prompt to a text-to-image model. Specifically, we employ the large vision language models (LVLMs) as the solver to rewrite the user prompt, and concurrently, employ LVLMs as a reward model to score the aesthetics and alignment of the images generated by the optimized prompt. Instead of laborious human feedback, we exploit the prior knowledge of the LVLM to provide rewards, i.e., AI feedback. Simultaneously, the solver and the reward model are unified into one model and iterated in reinforcement learning to achieve self-improvement by giving a solution and judging itself. Results on two popular datasets demonstrate that our method outperforms other strong competitors.

[159] PATS: Proficiency-Aware Temporal Sampling for Multi-View Sports Skill Assessment

Edoardo Bianchi, Antonio Liotta

Main category: cs.CV

TL;DR: PATS is a novel temporal sampling method that preserves complete fundamental movements in continuous segments for automated sports skill assessment, outperforming state-of-the-art methods across various sports domains.

Details

Motivation: Current video sampling methods disrupt temporal continuity essential for proficiency evaluation in sports skill assessment, making it difficult to capture fundamental movement patterns that distinguish expert from novice performance.

Method: Proficiency-Aware Temporal Sampling (PATS) adaptively segments videos to ensure each analyzed portion contains full execution of critical performance components, repeating this process across multiple segments to maximize information coverage while maintaining temporal coherence.

Result: PATS surpasses state-of-the-art accuracy across all viewing configurations (+0.65% to +3.05%) on EgoExo4D benchmark with SkillFormer, with substantial gains in challenging domains (+26.22% bouldering, +2.39% music, +1.13% basketball).

Conclusion: PATS successfully adapts to diverse activity characteristics and demonstrates effectiveness as an adaptive temporal sampling approach that advances automated skill assessment for real-world applications.

Abstract: Automated sports skill assessment requires capturing fundamental movement patterns that distinguish expert from novice performance, yet current video sampling methods disrupt the temporal continuity essential for proficiency evaluation. To this end, we introduce Proficiency-Aware Temporal Sampling (PATS), a novel sampling strategy that preserves complete fundamental movements within continuous temporal segments for multi-view skill assessment. PATS adaptively segments videos to ensure each analyzed portion contains full execution of critical performance components, repeating this process across multiple segments to maximize information coverage while maintaining temporal coherence. Evaluated on the EgoExo4D benchmark with SkillFormer, PATS surpasses the state-of-the-art accuracy across all viewing configurations (+0.65% to +3.05%) and delivers substantial gains in challenging domains (+26.22% bouldering, +2.39% music, +1.13% basketball). Systematic analysis reveals that PATS successfully adapts to diverse activity characteristics-from high-frequency sampling for dynamic sports to fine-grained segmentation for sequential skills-demonstrating its effectiveness as an adaptive approach to temporal sampling that advances automated skill assessment for real-world applications.

[160] Earth Observation Foundation Model PhilEO: Pretraining on the MajorTOM and FastTOM Datasets

Nikolaos Dionelis, Jente Bosmans, Riccardo Musto, Giancarlo Paoletti, Simone Sarti, Giacomo Cascarano, Casper Fibaek, Luke Camilleri, Bertrand Le Saux, Nicolas Longépé

Main category: cs.CV

TL;DR: Scaling up PhilEO foundation model on 23TB MajorTOM dataset improves performance on Earth observation tasks like road/building density estimation and land cover segmentation, with Vision Transformers outperforming CNNs.

Details

Motivation: Earth Observation satellites generate massive data volumes (1.6TB/day from Sentinel-2 alone), requiring foundation models pretrained on large unlabeled datasets to enable efficient fine-tuning for multiple downstream tasks with minimal labeled data.

Method: Developed various PhilEO model variants with different parameter counts and architectures (U-Net CNNs to Vision Transformers), pretrained on 23TB MajorTOM dataset and 2TB FastTOM subset, then fine-tuned on PhilEO Bench for road density estimation, building density regression, and land cover segmentation.

Result: PhilEO 44M MajorTOM 23TB outperformed PhilEO Globe 0.5TB 44M for all n-shots in road density regression. PhilEO 200M FastTOM outperformed all other models for most n-shots in road and building density estimation. Both dataset and model scaling showed effectiveness.

Conclusion: The study validates the effectiveness of scaling both datasets and models for Earth Observation foundation models, with Vision Transformers demonstrating superior performance over traditional CNN architectures like U-Net.

Abstract: Today, Earth Observation (EO) satellites generate massive volumes of data, with the Copernicus Sentinel-2 constellation alone producing approximately 1.6TB per day. To fully exploit this information, it is essential to pretrain EO Foundation Models (FMs) on large unlabeled datasets, enabling efficient fine-tuning for several different downstream tasks with minimal labeled data. In this work, we present the scaling-up of our recently proposed EO Foundation Model, PhilEO Geo-Aware U-Net, on the unlabeled 23TB dataset MajorTOM, which covers the vast majority of the Earth’s surface, as well as on the specialized subset FastTOM 2TB that does not include oceans and ice. We develop and study various PhilEO model variants with different numbers of parameters and architectures. We fine-tune the models on the PhilEO Bench for road density estimation, building density pixel-wise regression, and land cover semantic segmentation, and we evaluate the performance. Our results demonstrate that for all n-shots for road density regression, the PhilEO 44M MajorTOM 23TB model outperforms PhilEO Globe 0.5TB 44M. We also show that for most n-shots for road density estimation and building density regression, PhilEO 200M FastTOM outperforms all the other models. The effectiveness of both dataset and model scaling is validated using the PhilEO Bench. We also study the impact of architecture scaling, transitioning from U-Net Convolutional Neural Networks (CNN) to Vision Transformers (ViT).

[161] Survivability of Backdoor Attacks on Unconstrained Face Recognition Systems

Quentin Le Roux, Yannick Teglia, Teddy Furon, Philippe Loubet-Moundi, Eric Bourbao

Main category: cs.CV

TL;DR: First comprehensive system-level analysis of backdoor attacks on face recognition systems, showing feature extractors are vulnerable and single backdoors can compromise entire pipelines.

Details

Motivation: Deep learning face recognition systems are widely deployed but their security vulnerabilities, particularly backdoor attacks on real-world unconstrained pipelines, remain underexplored.

Method: Analyzed 20 pipeline configurations and 15 attack scenarios to demonstrate vulnerabilities in face feature extractors trained with large margin metric learning losses.

Result: Showed that face feature extractors are susceptible to backdoor attacks and a single backdoor can compromise an entire face recognition system.

Conclusion: Proposed effective best practices and countermeasures for stakeholders to protect against backdoor attacks in face recognition systems.

Abstract: The widespread deployment of Deep Learning-based Face Recognition Systems raises multiple security concerns. While prior research has identified backdoor vulnerabilities on isolated components, Backdoor Attacks on real-world, unconstrained pipelines remain underexplored. This paper presents the first comprehensive system-level analysis of Backdoor Attacks targeting Face Recognition Systems and provides three contributions. We first show that face feature extractors trained with large margin metric learning losses are susceptible to Backdoor Attacks. By analyzing 20 pipeline configurations and 15 attack scenarios, we then reveal that a single backdoor can compromise an entire Face Recognition System. Finally, we propose effective best practices and countermeasures for stakeholders.

[162] Geometry and Perception Guided Gaussians for Multiview-consistent 3D Generation from a Single Image

Pufan Li, Bi’an Du, Wei Hu

Main category: cs.CV

TL;DR: A novel method that integrates geometry and perception priors to generate detailed 3D objects from single images, using Gaussian branches initialization and stable Score Distillation Sampling for improved multiview consistency and geometric detail.

Details

Motivation: Existing approaches for 3D object generation from single-view images suffer from poor multiview consistency and lack geometric detail, often relying on fine-tuning 2D diffusion models or direct 3D generation methods.

Method: Integrates geometry and perception priors to initialize Gaussian branches and guide parameter optimization. Uses geometry prior for rough 3D shapes and perception prior from 2D diffusion models for multiview information. Employs stable Score Distillation Sampling for fine-grained prior distillation and reprojection-based strategy for depth consistency.

Result: Outperforms existing methods on novel view synthesis and 3D reconstruction, demonstrating robust and consistent 3D object generation with improved multiview consistency and geometric detail.

Conclusion: The proposed method successfully addresses the limitations of existing approaches by seamlessly integrating geometry and perception information without additional model training, achieving superior performance in generating realistic 3D objects from single images.

Abstract: Generating realistic 3D objects from single-view images requires natural appearance, 3D consistency, and the ability to capture multiple plausible interpretations of unseen regions. Existing approaches often rely on fine-tuning pretrained 2D diffusion models or directly generating 3D information through fast network inference or 3D Gaussian Splatting, but their results generally suffer from poor multiview consistency and lack geometric detail. To tackle these issues, we present a novel method that seamlessly integrates geometry and perception information without requiring additional model training to reconstruct detailed 3D objects from a single image. Specifically, we incorporate geometry and perception priors to initialize the Gaussian branches and guide their parameter optimization. The geometry prior captures the rough 3D shapes, while the perception prior utilizes the 2D pretrained diffusion model to enhance multiview information. Subsequently, we introduce a stable Score Distillation Sampling for fine-grained prior distillation to ensure effective knowledge transfer. The model is further enhanced by a reprojection-based strategy that enforces depth consistency. Experimental results show that we outperform existing methods on novel view synthesis and 3D reconstruction, demonstrating robust and consistent 3D object generation.

[163] AdaFusion: Prompt-Guided Inference with Adaptive Fusion of Pathology Foundation Models

Yuxiang Xiao, Yang Hu, Bin Li, Tianyang Zhang, Zexi Li, Huazhu Fu, Jens Rittscher, Kaixiang Yang

Main category: cs.CV

TL;DR: AdaFusion is a prompt-guided inference framework that dynamically integrates multiple pathology foundation models to overcome latent biases and improve generalizability in histopathology applications.

Details

Motivation: Pathology foundation models have strong representational capabilities but suffer from opaque pretraining contexts that introduce latent biases, hindering generalizability and transparency in downstream applications.

Method: The proposed AdaFusion framework compresses and aligns tile-level features from diverse models and employs a lightweight attention mechanism to adaptively fuse them based on tissue phenotype context.

Result: AdaFusion consistently surpassed individual PFMs across three real-world benchmarks spanning treatment response prediction, tumour grading, and spatial gene expression inference, for both classification and regression tasks.

Conclusion: AdaFusion effectively bridges heterogeneous pathology foundation models, achieving enhanced performance while providing interpretable insights into each model’s biosemantic specialization.

Abstract: Pathology foundation models (PFMs) have demonstrated strong representational capabilities through self-supervised pre-training on large-scale, unannotated histopathology image datasets. However, their diverse yet opaque pretraining contexts, shaped by both data-related and structural/training factors, introduce latent biases that hinder generalisability and transparency in downstream applications. In this paper, we propose AdaFusion, a novel prompt-guided inference framework that, to our knowledge, is among the very first to dynamically integrate complementary knowledge from multiple PFMs. Our method compresses and aligns tile-level features from diverse models and employs a lightweight attention mechanism to adaptively fuse them based on tissue phenotype context. We evaluate AdaFusion on three real-world benchmarks spanning treatment response prediction, tumour grading, and spatial gene expression inference. Our approach consistently surpasses individual PFMs across both classification and regression tasks, while offering interpretable insights into each model’s biosemantic specialisation. These results highlight AdaFusion’s ability to bridge heterogeneous PFMs, achieving both enhanced performance and interpretability of model-specific inductive biases.

Andrea Montibeller, Dasara Shullani, Daniele Baracchi, Alessandro Piva, Giulia Boato

Main category: cs.CV

TL;DR: Proposes a framework to emulate social media video compression for better deepfake detector training, achieving comparable performance to real shared media training.

Details

Motivation: AI-generated videos on social networks challenge deepfake detection as lab-trained detectors fail in real-world due to platform-specific compression that removes forensic cues.

Method: Estimates compression and resizing parameters from uploaded videos to create local emulator that reproduces platform-specific artifacts without API access.

Result: Emulated data closely matches real upload degradation patterns, and detectors fine-tuned on emulated videos perform comparably to those trained on actual shared media.

Conclusion: Provides scalable solution for bridging lab-based training and real-world deployment of deepfake detectors, especially for compressed video content.

Abstract: The growing presence of AI-generated videos on social networks poses new challenges for deepfake detection, as detectors trained under controlled conditions often fail to generalize to real-world scenarios. A key factor behind this gap is the aggressive, proprietary compression applied by platforms like YouTube and Facebook, which launder low-level forensic cues. However, replicating these transformations at scale is difficult due to API limitations and data-sharing constraints. For these reasons, we propose a first framework that emulates the video sharing pipelines of social networks by estimating compression and resizing parameters from a small set of uploaded videos. These parameters enable a local emulator capable of reproducing platform-specific artifacts on large datasets without direct API access. Experiments on FaceForensics++ videos shared via social networks demonstrate that our emulated data closely matches the degradation patterns of real uploads. Furthermore, detectors fine-tuned on emulated videos achieve comparable performance to those trained on actual shared media. Our approach offers a scalable and practical solution for bridging the gap between lab-based training and real-world deployment of deepfake detectors, particularly in the underexplored domain of compressed video content.

[165] DRespNeT: A UAV Dataset and YOLOv8-DRN Model for Aerial Instance Segmentation of Building Access Points for Post-Earthquake Search-and-Rescue Missions

Aykut Sirma, Angelos Plastropoulos, Gilbert Tang, Argyrios Zolotas

Main category: cs.CV

TL;DR: DRespNeT is a high-resolution aerial dataset for post-earthquake instance segmentation with detailed annotations of 28 critical classes, enabling real-time detection of structural damage, access points, and obstacles to improve search-and-rescue operations.

Details

Motivation: Timely identification of accessible entry points and structural obstacles is essential for effective search-and-rescue operations after earthquakes, but existing datasets rely on satellite imagery or coarse semantic labeling.

Method: Developed DRespNeT dataset with detailed polygon-level instance segmentation annotations from 1080p aerial footage of disaster zones, including 28 operationally critical classes. Used YOLO-based instance segmentation models (YOLOv8-seg) for evaluation.

Result: Optimized YOLOv8-DRN model achieved 92.7% mAP50 with 27 FPS inference speed on RTX-4090 GPU, meeting real-time operational requirements for multi-target detection.

Conclusion: DRespNeT dataset and models significantly enhance real-time situational awareness, support SAR teams and robotic systems, and improve human-robot collaboration for emergency response and survivor outcomes.

Abstract: Recent advancements in computer vision and deep learning have enhanced disaster-response capabilities, particularly in the rapid assessment of earthquake-affected urban environments. Timely identification of accessible entry points and structural obstacles is essential for effective search-and-rescue (SAR) operations. To address this need, we introduce DRespNeT, a high-resolution dataset specifically developed for aerial instance segmentation of post-earthquake structural environments. Unlike existing datasets, which rely heavily on satellite imagery or coarse semantic labeling, DRespNeT provides detailed polygon-level instance segmentation annotations derived from high-definition (1080p) aerial footage captured in disaster zones, including the 2023 Turkiye earthquake and other impacted regions. The dataset comprises 28 operationally critical classes, including structurally compromised buildings, access points such as doors, windows, and gaps, multiple debris levels, rescue personnel, vehicles, and civilian visibility. A distinctive feature of DRespNeT is its fine-grained annotation detail, enabling differentiation between accessible and obstructed areas, thereby improving operational planning and response efficiency. Performance evaluations using YOLO-based instance segmentation models, specifically YOLOv8-seg, demonstrate significant gains in real-time situational awareness and decision-making. Our optimized YOLOv8-DRN model achieves 92.7% mAP50 with an inference speed of 27 FPS on an RTX-4090 GPU for multi-target detection, meeting real-time operational requirements. The dataset and models support SAR teams and robotic systems, providing a foundation for enhancing human-robot collaboration, streamlining emergency response, and improving survivor outcomes.

[166] HiddenObject: Modality-Agnostic Fusion for Multimodal Hidden Object Detection

Harris Song, Tuan-Anh Vu, Sanjith Menon, Sriram Narasimhan, M. Khalid Jawed

Main category: cs.CV

TL;DR: HiddenObject is a Mamba-based fusion framework that integrates RGB, thermal, and depth data to detect hidden/occluded objects, outperforming traditional unimodal and naive fusion methods.

Details

Motivation: Traditional RGB-based detection methods fail under adverse conditions like occlusion, camouflage, and lighting variations, creating need for robust modality-agnostic approaches.

Method: Integrates RGB, thermal, and depth data using Mamba-based fusion mechanism that captures complementary signals across modalities and fuses them into unified representation.

Result: Achieves state-of-the-art or competitive performance across multiple benchmark datasets, demonstrating enhanced detection of obscured/camouflaged targets.

Conclusion: Mamba-based fusion architectures significantly advance multimodal object detection, especially under visually degraded or complex conditions, exposing limitations of current unimodal/fusion strategies.

Abstract: Detecting hidden or partially concealed objects remains a fundamental challenge in multimodal environments, where factors like occlusion, camouflage, and lighting variations significantly hinder performance. Traditional RGB-based detection methods often fail under such adverse conditions, motivating the need for more robust, modality-agnostic approaches. In this work, we present HiddenObject, a fusion framework that integrates RGB, thermal, and depth data using a Mamba-based fusion mechanism. Our method captures complementary signals across modalities, enabling enhanced detection of obscured or camouflaged targets. Specifically, the proposed approach identifies modality-specific features and fuses them in a unified representation that generalizes well across challenging scenarios. We validate HiddenObject across multiple benchmark datasets, demonstrating state-of-the-art or competitive performance compared to existing methods. These results highlight the efficacy of our fusion design and expose key limitations in current unimodal and na"ive fusion strategies. More broadly, our findings suggest that Mamba-based fusion architectures can significantly advance the field of multimodal object detection, especially under visually degraded or complex conditions.

[167] Backdoor Poisoning Attack Against Face Spoofing Attack Detection Methods

Shota Iwamatsu, Koichi Ito, Takafumi Aoki

Main category: cs.CV

TL;DR: Proposes a backdoor poisoning attack method that allows spoofing attacks to bypass face anti-spoofing detection by embedding spoofing features into live images without visual changes.

Details

Motivation: Face recognition systems are vulnerable to spoofing attacks using photos, and existing deep learning-based detection methods can be compromised if malicious data is injected into training datasets.

Method: Embed features extracted from spoofing attack face images into live face images without causing perceptible visual alterations, creating poisoned training data.

Result: Experiments on public datasets show the method successfully enables spoofing attacks to bypass detection, demonstrating a realistic threat to existing systems.

Conclusion: The proposed backdoor poisoning attack method reveals a latent security threat in face anti-spoofing detection systems that can be exploited to bypass protection mechanisms.

Abstract: Face recognition systems are robust against environmental changes and noise, and thus may be vulnerable to illegal authentication attempts using user face photos, such as spoofing attacks. To prevent such spoofing attacks, it is crucial to discriminate whether the input image is a live user image or a spoofed image prior to the face recognition process. Most existing spoofing attack detection methods utilize deep learning, which necessitates a substantial amount of training data. Consequently, if malicious data is injected into a portion of the training dataset, a specific spoofing attack may be erroneously classified as live, leading to false positives. In this paper, we propose a novel backdoor poisoning attack method to demonstrate the latent threat of backdoor poisoning within face anti-spoofing detection. The proposed method enables certain spoofing attacks to bypass detection by embedding features extracted from the spoofing attack’s face image into a live face image without inducing any perceptible visual alterations. Through experiments conducted on public datasets, we demonstrate that the proposed method constitutes a realistic threat to existing spoofing attack detection systems.

[168] OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation

Han Li, Xinyu Peng, Yaoming Wang, Zelin Peng, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Wenrui Dai, Hongkai Xiong

Main category: cs.CV

TL;DR: OneCAT is a unified multimodal model using pure decoder-only transformer architecture that integrates understanding, generation, and editing without external vision components, achieving state-of-the-art performance with efficiency gains.

Details

Motivation: To create a more efficient and unified multimodal model that eliminates the need for external vision components like ViT or vision tokenizers during inference, particularly for high-resolution inputs, while maintaining top performance.

Method: Uses a modality-specific Mixture-of-Experts (MoE) structure trained with single autoregressive objective, incorporates multi-scale visual autoregressive mechanism within LLM, and supports dynamic resolutions natively.

Result: Significantly outperforms existing open-source unified multimodal models across benchmarks for multimodal generation, editing, and understanding, with drastic reduction in decoding steps compared to diffusion methods.

Conclusion: Pure autoregressive modeling serves as a sufficient and elegant foundation for unified multimodal intelligence, demonstrating powerful potential for efficient multimodal integration.

Abstract: We introduce OneCAT, a unified multimodal model that seamlessly integrates understanding, generation, and editing within a novel, pure decoder-only transformer architecture. Our framework uniquely eliminates the need for external components such as Vision Transformers (ViT) or vision tokenizer during inference, leading to significant efficiency gains, especially for high-resolution inputs. This is achieved through a modality-specific Mixture-of-Experts (MoE) structure trained with a single autoregressive (AR) objective, which also natively supports dynamic resolutions. Furthermore, we pioneer a multi-scale visual autoregressive mechanism within the Large Language Model (LLM) that drastically reduces decoding steps compared to diffusion-based methods while maintaining state-of-the-art performance. Our findings demonstrate the powerful potential of pure autoregressive modeling as a sufficient and elegant foundation for unified multimodal intelligence. As a result, OneCAT sets a new performance standard, outperforming existing open-source unified multimodal models across benchmarks for multimodal generation, editing, and understanding.

[169] ANTS: Shaping the Adaptive Negative Textual Space by MLLM for OOD Detection

Wenjie Zhu, Yabin Zhang, Xin Jin, Wenjun Zeng, Lei Zhang

Main category: cs.CV

TL;DR: Proposes ANTS - an adaptive negative textual space method using MLLMs to generate precise negative descriptions for both far and near OOD detection, achieving state-of-the-art performance on ImageNet with 4.2% FPR95 reduction.

Details

Motivation: Existing OOD detection methods using negative labels lack understanding of OOD images and suffer from false negatives that degrade near-OOD performance.

Method: Leverages MLLMs to generate expressive negative sentences from identified OOD images for far-OOD, and creates visually similar negative labels for ID classes to reduce false negatives in near-OOD. Uses adaptive weighted score to balance both approaches.

Result: Achieves 4.2% reduction in FPR95 on ImageNet benchmark, establishing new state-of-the-art. Method is training-free and zero-shot for high scalability.

Conclusion: ANTS effectively addresses both far and near OOD detection by leveraging MLLMs’ understanding capabilities to create adaptive negative textual spaces, demonstrating superior performance without task-specific prior knowledge.

Abstract: The introduction of negative labels (NLs) has proven effective in enhancing Out-of-Distribution (OOD) detection. However, existing methods often lack an understanding of OOD images, making it difficult to construct an accurate negative space. In addition, the presence of false negative labels significantly degrades their near-OOD performance. To address these issues, we propose shaping an Adaptive Negative Textual Space (ANTS) by leveraging the understanding and reasoning capabilities of multimodal large language models (MLLMs). Specifically, we identify images likely to be OOD samples as negative images and prompt the MLLM to describe these images, generating expressive negative sentences that precisely characterize the OOD distribution and enhance far-OOD detection. For the near-OOD setting, where OOD samples resemble the in-distribution (ID) subset, we first identify the subset of ID classes that are visually similar to negative images and then leverage the reasoning capability of MLLMs to generate visually similar negative labels tailored to this subset, effectively reducing false negatives and improving near-OOD detection. To balance these two types of negative textual spaces, we design an adaptive weighted score that enables the method to handle different OOD task settings (near-OOD and far-OOD) without relying on task-specific prior knowledge, making it highly adaptable in open environments. On the ImageNet benchmark, our ANTS significantly reduces the FPR95 by 4.2%, establishing a new state-of-the-art. Furthermore, our method is training-free and zero-shot, enabling high scalability.

[170] Hybrid Swin Attention Networks for Simultaneously Low-Dose PET and CT Denoising

Yichao Liu, Hengzhi Xue, YueYang Teng

Main category: cs.CV

TL;DR: HSANet is a novel hybrid Swin attention network that effectively denoises low-dose CT/PET images while maintaining lightweight architecture for clinical deployment.

Details

Motivation: Low-dose CT and PET reduce radiation exposure but introduce noise and artifacts that compromise diagnostic accuracy, requiring effective denoising methods.

Method: Hybrid Swin Attention Network with Efficient Global Attention modules for spatial/channel interaction and hybrid upsampling module to prevent noise overfitting.

Result: HSANet achieves superior denoising performance compared to existing methods while maintaining lightweight model size suitable for standard GPU deployment.

Conclusion: The approach is highly practical for real-world clinical applications, providing effective denoising with computational efficiency.

Abstract: Low-dose computed tomography (LDCT) and positron emission tomography (PET) have emerged as safer alternatives to conventional imaging modalities by significantly reducing radiation exposure. However, this reduction often results in increased noise and artifacts, which can compromise diagnostic accuracy. Consequently, denoising for LDCT/PET has become a vital area of research aimed at enhancing image quality while maintaining radiation safety. In this study, we introduce a novel Hybrid Swin Attention Network (HSANet), which incorporates Efficient Global Attention (EGA) modules and a hybrid upsampling module. The EGA modules enhance both spatial and channel-wise interaction, improving the network’s capacity to capture relevant features, while the hybrid upsampling module mitigates the risk of overfitting to noise. We validate the proposed approach using a publicly available LDCT/PET dataset. Experimental results demonstrate that HSANet achieves superior denoising performance compared to existing methods, while maintaining a lightweight model size suitable for deployment on GPUs with standard memory configurations. This makes our approach highly practical for real-world clinical applications.

[171] Similarity-based Outlier Detection for Noisy Object Re-Identification Using Beta Mixtures

Waqar Ahmad, Evan Murphy, Vladimir A. Krylov

Main category: cs.CV

TL;DR: Beta-SOD: A novel Beta mixture similarity-based outlier detection framework for robust object re-identification that handles label noise by modeling pairwise similarity distributions with identifiable Beta mixtures.

Details

Motivation: Object re-identification methods are highly sensitive to label noise, which causes significant performance degradation. Existing approaches struggle with noisy label scenarios in Re-ID tasks.

Method: Reframes Re-ID as supervised image similarity using Siamese networks with Beta-SOD framework - models cosine similarity distributions with two-component Beta mixture model. Combines binary cross-entropy, contrastive, and cosine embedding losses for joint optimization.

Result: Superior performance on person Re-ID (CUHK03, Market-1501) and vehicle Re-ID (VeRi-776) across 10-30% noise levels. Outperforms state-of-the-art methods in noisy scenarios.

Conclusion: Beta-SOD provides robust and broadly applicable solution for noisy Re-ID tasks, with proven identifiability of Beta mixtures ensuring well-posed learning framework.

Abstract: Object re-identification (Re-ID) methods are highly sensitive to label noise, which typically leads to significant performance degradation. We address this challenge by reframing Re-ID as a supervised image similarity task and adopting a Siamese network architecture trained to capture discriminative pairwise relationships. Central to our approach is a novel statistical outlier detection (OD) framework, termed Beta-SOD (Beta mixture Similarity-based Outlier Detection), which models the distribution of cosine similarities between embedding pairs using a two-component Beta distribution mixture model. We establish a novel identifiability result for mixtures of two Beta distributions, ensuring that our learning task is well-posed. The proposed OD step complements the Re-ID architecture combining binary cross-entropy, contrastive, and cosine embedding losses that jointly optimize feature-level similarity learning.We demonstrate the effectiveness of Beta-SOD in de-noising and Re-ID tasks for person Re-ID, on CUHK03 and Market-1501 datasets, and vehicle Re-ID, on VeRi-776 dataset. Our method shows superior performance compared to the state-of-the-art methods across various noise levels (10-30%), demonstrating both robustness and broad applicability in noisy Re-ID scenarios. The implementation of Beta-SOD is available at: github.com/waqar3411/Beta-SOD

[172] SFD-Mamba2Net: Structure-Guided Frequency-Enhanced Dual-Stream Mamba2 Network for Coronary Artery Segmentation

Nan Mu, Ruiqi Song, Zhihui Xu, Jingfeng Jiang, Chen Zhao

Main category: cs.CV

TL;DR: SFD-Mamba2Net improves coronary artery segmentation and stenosis detection in ICA images using multi-scale structural priors, state-space modeling, and frequency-domain enhancement.

Details

Motivation: Coronary artery disease diagnosis requires precise vessel segmentation and stenosis detection from low-contrast, noisy ICA images, which existing methods struggle with due to complex vascular structures.

Method: End-to-end framework with CASE module for multi-scale structural enhancement in encoder, and PHFP module using wavelet decomposition for high-frequency detail refinement in decoder.

Result: Outperformed state-of-the-art methods across eight segmentation metrics and achieved highest true positive rate and positive predictive value in stenosis detection.

Conclusion: SFD-Mamba2Net effectively addresses challenges in ICA image analysis through integrated structural and frequency-domain approaches, demonstrating superior performance for clinical CAD diagnosis.

Abstract: Background: Coronary Artery Disease (CAD) is one of the leading causes of death worldwide. Invasive Coronary Angiography (ICA), regarded as the gold standard for CAD diagnosis, necessitates precise vessel segmentation and stenosis detection. However, ICA images are typically characterized by low contrast, high noise levels, and complex, fine-grained vascular structures, which pose significant challenges to the clinical adoption of existing segmentation and detection methods. Objective: This study aims to improve the accuracy of coronary artery segmentation and stenosis detection in ICA images by integrating multi-scale structural priors, state-space-based long-range dependency modeling, and frequency-domain detail enhancement strategies. Methods: We propose SFD-Mamba2Net, an end-to-end framework tailored for ICA-based vascular segmentation and stenosis detection. In the encoder, a Curvature-Aware Structural Enhancement (CASE) module is embedded to leverage multi-scale responses for highlighting slender tubular vascular structures, suppressing background interference, and directing attention toward vascular regions. In the decoder, we introduce a Progressive High-Frequency Perception (PHFP) module that employs multi-level wavelet decomposition to progressively refine high-frequency details while integrating low-frequency global structures. Results and Conclusions: SFD-Mamba2Net consistently outperformed state-of-the-art methods across eight segmentation metrics, and achieved the highest true positive rate and positive predictive value in stenosis detection.

[173] Region-Wise Correspondence Prediction between Manga Line Art Images

Yingxuan Li, Jiafeng Mao, Qianru Qiu, Yusuke Matsui

Main category: cs.CV

TL;DR: A novel Transformer-based method for predicting region-wise correspondence between unlabeled manga line art images using patch-level similarity learning and edge-aware clustering.

Details

Motivation: Region-wise correspondence in manga line art is fundamental for applications like automatic colorization and frame generation, but remains unexplored in realistic scenarios without pre-existing segmentation or annotations.

Method: Divide line art images into patches, use Transformer-based framework to learn patch-level similarities, apply edge-aware clustering and region matching algorithm to convert patch predictions into coherent region-level correspondences.

Result: Achieves high patch-level accuracy (96.34%) and generates consistent region-level correspondences across multiple datasets.

Conclusion: The method demonstrates strong performance and practical potential for real-world manga applications without requiring pre-existing labels or masks.

Abstract: Understanding region-wise correspondence between manga line art images is a fundamental task in manga processing, enabling downstream applications such as automatic line art colorization and in-between frame generation. However, this task remains largely unexplored, especially in realistic scenarios without pre-existing segmentation or annotations. In this paper, we introduce a novel and practical task: predicting region-wise correspondence between raw manga line art images without any pre-existing labels or masks. To tackle this problem, we divide each line art image into a set of patches and propose a Transformer-based framework that learns patch-level similarities within and across images. We then apply edge-aware clustering and a region matching algorithm to convert patch-level predictions into coherent region-level correspondences. To support training and evaluation, we develop an automatic annotation pipeline and manually refine a subset of the data to construct benchmark datasets. Experiments on multiple datasets demonstrate that our method achieves high patch-level accuracy (e.g., 96.34%) and generates consistent region-level correspondences, highlighting its potential for real-world manga applications.

cs.AI

[174] Human-AI Collaboration Increases Efficiency in Regulatory Writing

Umut Eser, Yael Gozin, L. Jay Stallons, Ari Caroline, Martin Preusse, Brandon Rice, Scott Wright, Andrew Robertson

Main category: cs.AI

TL;DR: AutoIND LLM platform reduces IND application drafting time by ~97% while maintaining acceptable quality with no critical regulatory errors, though human experts are still needed for final refinement.

Details

Motivation: IND application preparation is time-intensive and expertise-dependent, slowing early clinical development. The study aims to evaluate if LLMs can accelerate this process while maintaining quality.

Method: Compared AutoIND-generated IND summaries with manual drafting times from experienced regulatory writers. Quality was assessed by blinded expert using 7 criteria (correctness, completeness, conciseness, consistency, clarity, redundancy, emphasis) scored 0-3.

Result: AutoIND reduced drafting time by ~97% (from ~100h to 3.7h and 2.6h for two IND applications). Quality scores were 69.6% and 77.9% with no critical regulatory errors, but deficiencies in emphasis, conciseness and clarity were noted.

Conclusion: AutoIND dramatically accelerates IND drafting but expert regulatory writers remain essential for final refinement. Identified deficiencies provide roadmap for targeted model improvements.

Abstract: Background: Investigational New Drug (IND) application preparation is time-intensive and expertise-dependent, slowing early clinical development. Objective: To evaluate whether a large language model (LLM) platform (AutoIND) can reduce first-draft composition time while maintaining document quality in regulatory submissions. Methods: Drafting times for IND nonclinical written summaries (eCTD modules 2.6.2, 2.6.4, 2.6.6) generated by AutoIND were directly recorded. For comparison, manual drafting times for IND summaries previously cleared by the U.S. FDA were estimated from the experience of regulatory writers ($\geq$6 years) and used as industry-standard benchmarks. Quality was assessed by a blinded regulatory writing assessor using seven pre-specified categories: correctness, completeness, conciseness, consistency, clarity, redundancy, and emphasis. Each sub-criterion was scored 0-3 and normalized to a percentage. A critical regulatory error was defined as any misrepresentation or omission likely to alter regulatory interpretation (e.g., incorrect NOAEL, omission of mandatory GLP dose-formulation analysis). Results: AutoIND reduced initial drafting time by $\sim$97% (from $\sim$100 h to 3.7 h for 18,870 pages/61 reports in IND-1; and to 2.6 h for 11,425 pages/58 reports in IND-2). Quality scores were 69.6% and 77.9% for IND-1 and IND-2. No critical regulatory errors were detected, but deficiencies in emphasis, conciseness, and clarity were noted. Conclusions: AutoIND can dramatically accelerate IND drafting, but expert regulatory writers remain essential to mature outputs to submission-ready quality. Systematic deficiencies identified provide a roadmap for targeted model improvements.

[175] Executable Ontologies: Synthesizing Event Semantics with Dataflow Architecture

Aleksandr Boldachev

Main category: cs.AI

TL;DR: Boldsea is an architecture that uses executable ontologies to model complex dynamic systems, integrating event semantics with dataflow to overcome limitations of traditional BPM systems and object-oriented approaches.

Details

Motivation: To address the limitations of traditional Business Process Management systems and object-oriented semantic technologies by creating a more flexible and dynamic approach to modeling complex systems.

Method: Developed the BSL (boldsea Semantic Language) with formal BNF grammar and created the boldsea-engine architecture that directly interprets semantic models as executable algorithms without compilation.

Result: The system enables runtime modification of event models, ensures temporal transparency, and seamlessly merges data and business logic within a unified semantic framework.

Conclusion: Boldsea provides an effective architecture for modeling complex dynamic systems through executable ontologies that overcome traditional BPM limitations and offer real-time adaptability.

Abstract: This paper presents boldsea, Boldachev’s semantic-event approach – an architecture for modeling complex dynamic systems using executable ontologies – semantic models that act as dynamic structures, directly controlling process execution. We demonstrate that integrating event semantics with a dataflow architecture addresses the limitations of traditional Business Process Management (BPM) systems and object-oriented semantic technologies. The paper presents the formal BSL (boldsea Semantic Language), including its BNF grammar, and outlines the boldsea-engine’s architecture, which directly interprets semantic models as executable algorithms without compilation. It enables the modification of event models at runtime, ensures temporal transparency, and seamlessly merges data and business logic within a unified semantic framework.

[176] Towards Fully Automated Molecular Simulations: Multi-Agent Framework for Simulation Setup and Force Field Extraction

Marko Petković, Vlado Menkovski, Sofía Calero

Main category: cs.AI

TL;DR: A multi-agent framework using LLM-based agents to autonomously perform porous materials characterization through automated simulation setup, force field selection, execution, and result interpretation.

Details

Motivation: Automated characterization of porous materials can accelerate materials discovery but is limited by the complexity of simulation setup and force field selection.

Method: Multi-agent system with LLM-based agents that autonomously understand characterization tasks, plan simulations, assemble force fields, execute RASPA simulations, and interpret results to guide subsequent steps.

Result: Initial evaluations demonstrate high correctness and reproducibility in literature-informed force field extraction and automated simulation setup.

Conclusion: This approach shows potential for enabling fully autonomous, scalable materials characterization to accelerate materials discovery.

Abstract: Automated characterization of porous materials has the potential to accelerate materials discovery, but it remains limited by the complexity of simulation setup and force field selection. We propose a multi-agent framework in which LLM-based agents can autonomously understand a characterization task, plan appropriate simulations, assemble relevant force fields, execute them and interpret their results to guide subsequent steps. As a first step toward this vision, we present a multi-agent system for literature-informed force field extraction and automated RASPA simulation setup. Initial evaluations demonstrate high correctness and reproducibility, highlighting this approach’s potential to enable fully autonomous, scalable materials characterization.

[177] How well can LLMs provide planning feedback in grounded environments?

Yuxuan Li, Victor Zhong

Main category: cs.AI

TL;DR: Foundation models (LLMs/VLMs) can provide diverse, high-quality planning feedback across domains, with larger reasoning models performing better, though performance degrades in complex continuous environments.

Details

Motivation: Reduce the need for carefully designed reward functions and high-quality demonstrations in grounded environment planning by leveraging pretrained foundation models' background knowledge.

Method: Evaluate LLMs and VLMs on various feedback types (binary, preference, action advising, goal advising, delta action) across symbolic, language, and continuous control environments using different inference methods (in-context learning, chain-of-thought, environment dynamics access).

Result: Foundation models provide high-quality feedback across domains; larger reasoning models offer more accurate feedback with less bias and benefit more from enhanced inference methods; feedback quality degrades in environments with complex dynamics or continuous state/action spaces.

Conclusion: Pretrained foundation models are effective for providing planning feedback, reducing reliance on reward engineering and demonstrations, though performance limitations exist in complex continuous environments.

Abstract: Learning to plan in grounded environments typically requires carefully designed reward functions or high-quality annotated demonstrations. Recent works show that pretrained foundation models, such as large language models (LLMs) and vision language models (VLMs), capture background knowledge helpful for planning, which reduces the amount of reward design and demonstrations needed for policy learning. We evaluate how well LLMs and VLMs provide feedback across symbolic, language, and continuous control environments. We consider prominent types of feedback for planning including binary feedback, preference feedback, action advising, goal advising, and delta action feedback. We also consider inference methods that impact feedback performance, including in-context learning, chain-of-thought, and access to environment dynamics. We find that foundation models can provide diverse high-quality feedback across domains. Moreover, larger and reasoning models consistently provide more accurate feedback, exhibit less bias, and benefit more from enhanced inference methods. Finally, feedback quality degrades for environments with complex dynamics or continuous state spaces and action spaces.

[178] A Modular and Multimodal Generative AI Framework for Urban Building Energy Data: Generating Synthetic Homes

Jackson Eshbaugh, Chetan Tiwari, Jorge Silveyra

Main category: cs.AI

TL;DR: A modular multimodal framework using generative AI to create realistic energy modeling data from publicly available residential information and images, addressing data accessibility and privacy issues.

Details

Motivation: Computational energy models require extensive data that is often inaccessible, expensive, or raises privacy concerns, creating barriers to research.

Method: Developed a modular multimodal framework that leverages generative AI to produce labeled energy modeling data from publicly accessible residential information and images.

Result: The framework successfully produces realistic labeled data while avoiding common issues with generative models, reducing dependence on costly or restricted data sources.

Conclusion: This approach paves the way for more accessible and reproducible energy modeling research by providing an alternative to traditional data collection methods.

Abstract: Computational models have emerged as powerful tools for energy modeling research, touting scalability and quantitative results. However, these models require a plethora of data, some of which is inaccessible, expensive, or raises privacy concerns. We introduce a modular multimodal framework to produce this data from publicly accessible residential information and images using generative artificial intelligence (AI). Additionally, we provide a pipeline demonstrating this framework, and we evaluate its generative AI components. Our experiments show that our framework’s use of AI avoids common issues with generative models. Our framework produces realistic, labeled data. By reducing dependence on costly or restricted data sources, we pave a path towards more accessible and reproducible research.

[179] Towards a Common Framework for Autoformalization

Agnieszka Mensfelt, David Tena Cucala, Santiago Franco, Angeliki Koutsoukou-Argyraki, Vince Trencsenyi, Kostas Stathis

Main category: cs.AI

TL;DR: This paper reviews autoformalization - translating informal input into formal representations using LLMs - across mathematics and other domains, proposing a unified framework to connect disparate research areas.

Details

Motivation: The rapid but fragmented development of autoformalization research across different fields has limited opportunities for shared methodologies, benchmarks, and theoretical frameworks that could accelerate progress.

Method: The paper conducts a comprehensive review of both explicit and implicit instances of autoformalization research, analyzing approaches from mathematics formalization to broader logical representation tasks.

Result: The review identifies common patterns and challenges in autoformalization research across different domains, revealing the need for better integration between these related but isolated research streams.

Conclusion: A unified framework for autoformalization is proposed to encourage cross-pollination between different fields and accelerate the development of next-generation AI systems capable of formal reasoning and representation.

Abstract: Autoformalization has emerged as a term referring to the automation of formalization - specifically, the formalization of mathematics using interactive theorem provers (proof assistants). Its rapid development has been driven by progress in deep learning, especially large language models (LLMs). More recently, the term has expanded beyond mathematics to describe the broader task of translating informal input into formal logical representations. At the same time, a growing body of research explores using LLMs to translate informal language into formal representations for reasoning, planning, and knowledge representation - often without explicitly referring to this process as autoformalization. As a result, despite addressing similar tasks, the largely independent development of these research areas has limited opportunities for shared methodologies, benchmarks, and theoretical frameworks that could accelerate progress. The goal of this paper is to review - explicit or implicit

instances of what can be considered autoformalization and to propose a unified framework, encouraging cross-pollination between different fields to advance the development of next generation AI systems.

[180] Multi-Turn Human-LLM Interaction Through the Lens of a Two-Way Intelligibility Protocol

Harshvardhan Mestha, Karan Bania, Shreyas V, Sidong Liu, Ashwin Srinivasan

Main category: cs.AI

TL;DR: A structured protocol for human-LLM interaction using finite-state machines to achieve two-way intelligibility, tested in radiology and drug design domains.

Details

Motivation: To design software systems where human experts can effectively collaborate with LLMs on complex data analysis tasks through natural language, harnessing human expertise and creativity to find elusive solutions.

Method: Implemented an abstract protocol based on communicating finite-state machines to mediate structured interactions between humans and LLMs. Conducted controlled experiments with human proxies (databases) and uncontrolled experiments with human subjects in radiology and drug design domains.

Result: Empirical evidence supports the protocol’s capability to capture one- and two-way intelligibility in human-LLM interactions, demonstrating utility of two-way intelligibility in human-machine system design.

Conclusion: The structured protocol enables effective human-LLM collaboration through two-way intelligibility, providing a framework for designing interactive systems that leverage both human expertise and LLM capabilities for complex problem-solving.

Abstract: Our interest is in the design of software systems involving a human-expert interacting – using natural language – with a large language model (LLM) on data analysis tasks. For complex problems, it is possible that LLMs can harness human expertise and creativity to find solutions that were otherwise elusive. On one level, this interaction takes place through multiple turns of prompts from the human and responses from the LLM. Here we investigate a more structured approach based on an abstract protocol described in [3] for interaction between agents. The protocol is motivated by a notion of “two-way intelligibility” and is modelled by a pair of communicating finite-state machines. We provide an implementation of the protocol, and provide empirical evidence of using the implementation to mediate interactions between an LLM and a human-agent in two areas of scientific interest (radiology and drug design). We conduct controlled experiments with a human proxy (a database), and uncontrolled experiments with human subjects. The results provide evidence in support of the protocol’s capability of capturing one- and two-way intelligibility in human-LLM interaction; and for the utility of two-way intelligibility in the design of human-machine systems.

[181] Towards an AI-based knowledge assistant for goat farmers based on Retrieval-Augmented Generation

Nana Han, Dong Liu, Tomas Norton

Main category: cs.AI

TL;DR: An intelligent knowledge assistant system using RAG with structured knowledge processing methods (table textualization and decision-tree textualization) for goat farming health management, achieving over 85% accuracy across various Q&A tasks.

Details

Motivation: LLMs have limited application in livestock farming due to constraints in knowledge availability, diversity, and complexity. This study aims to support health management in farmed goats by enhancing LLMs' understanding of heterogeneous data formats.

Method: Leveraged Retrieval-Augmented Generation (RAG) with two structured knowledge processing methods: table textualization and decision-tree textualization. Established a domain-specific goat farming knowledge base spanning five key domains and integrated an online search module for real-time information retrieval.

Result: Heterogeneous knowledge fusion method achieved mean accuracies of 87.90% on validation set and 84.22% on test set. Accuracy consistently exceeded 85% across text-based, table-based, and decision-tree based Q&A tasks. Omission was identified as the predominant error category.

Conclusion: The system demonstrates robustness and reliability for practical applications in goat farming, with opportunities to further improve retrieval coverage and context integration.

Abstract: Large language models (LLMs) are increasingly being recognised as valuable knowledge communication tools in many industries. However, their application in livestock farming remains limited, being constrained by several factors not least the availability, diversity and complexity of knowledge sources. This study introduces an intelligent knowledge assistant system designed to support health management in farmed goats. Leveraging the Retrieval-Augmented Generation (RAG), two structured knowledge processing methods, table textualization and decision-tree textualization, were proposed to enhance large language models’ (LLMs) understanding of heterogeneous data formats. Based on these methods, a domain-specific goat farming knowledge base was established to improve LLM’s capacity for cross-scenario generalization. The knowledge base spans five key domains: Disease Prevention and Treatment, Nutrition Management, Rearing Management, Goat Milk Management, and Basic Farming Knowledge. Additionally, an online search module is integrated to enable real-time retrieval of up-to-date information. To evaluate system performance, six ablation experiments were conducted to examine the contribution of each component. The results demonstrated that heterogeneous knowledge fusion method achieved the best results, with mean accuracies of 87.90% on the validation set and 84.22% on the test set. Across the text-based, table-based, decision-tree based Q&A tasks, accuracy consistently exceeded 85%, validating the effectiveness of structured knowledge fusion within a modular design. Error analysis identified omission as the predominant error category, highlighting opportunities to further improve retrieval coverage and context integration. In conclusion, the results highlight the robustness and reliability of the proposed system for practical applications in goat farming.

[182] LLMs as Agentic Cooperative Players in Multiplayer UNO

Yago Romano Matinez, Jesse Roberts

Main category: cs.AI

TL;DR: LLMs were tested as UNO game assistants to help another player win rather than winning themselves. While models could play competently, most struggled to effectively assist another player.

Details

Motivation: To test whether LLMs can serve as active participants that provide genuine assistance to humans in accomplishing goals, using UNO as a testbed for collaborative gameplay.

Method: Built a tool for decoder-only LLMs to participate in RLCard game environment with full game-state information. Tested models from 1B to 70B parameters using two prompting strategies to assist another player rather than win themselves.

Result: All models outperformed random baseline when playing UNO normally, but few models were able to significantly help another player win the game.

Conclusion: LLMs can play games competently but struggle to provide effective assistance to human players, suggesting limitations in their ability to serve as active collaborative partners.

Abstract: LLMs promise to assist humans – not just by answering questions, but by offering useful guidance across a wide range of tasks. But how far does that assistance go? Can a large language model based agent actually help someone accomplish their goal as an active participant? We test this question by engaging an LLM in UNO, a turn-based card game, asking it not to win but instead help another player to do so. We built a tool that allows decoder-only LLMs to participate as agents within the RLCard game environment. These models receive full game-state information and respond using simple text prompts under two distinct prompting strategies. We evaluate models ranging from small (1B parameters) to large (70B parameters) and explore how model scale impacts performance. We find that while all models were able to successfully outperform a random baseline when playing UNO, few were able to significantly aid another player.

[183] The (R)evolution of Scientific Workflows in the Agentic AI Era: Towards Autonomous Science

Woong Shin, Renan Souza, Daniel Rosendo, Frédéric Suter, Feiyi Wang, Prasanna Balaprakash, Rafael Ferreira da Silva

Main category: cs.AI

TL;DR: A framework for evolving scientific workflows from static to intelligent and from single to swarm systems, proposing an architectural blueprint for autonomous scientific laboratories that could accelerate discovery by 100x.

Details

Motivation: Modern scientific discovery requires coordinating distributed facilities and heterogeneous resources, forcing researchers to act as manual workflow coordinators rather than focusing on science. AI agents show potential to accelerate discovery but need clear integration pathways.

Method: Proposes a conceptual framework with two evolutionary dimensions: intelligence (static to intelligent) and composition (single to swarm), along with an architectural blueprint for transitioning from current workflow systems to autonomous distributed laboratories.

Result: The paper presents a framework that charts an evolutionary path toward fully autonomous, distributed scientific laboratories capable of transformational scientific workflows.

Conclusion: The proposed architectural blueprint can help the community harness opportunities in autonomous science with potential for 100x discovery acceleration and transformative workflow capabilities.

Abstract: Modern scientific discovery increasingly requires coordinating distributed facilities and heterogeneous resources, forcing researchers to act as manual workflow coordinators rather than scientists. Advances in AI leading to AI agents show exciting new opportunities that can accelerate scientific discovery by providing intelligence as a component in the ecosystem. However, it is unclear how this new capability would materialize and integrate in the real world. To address this, we propose a conceptual framework where workflows evolve along two dimensions which are intelligence (from static to intelligent) and composition (from single to swarm) to chart an evolutionary path from current workflow management systems to fully autonomous, distributed scientific laboratories. With these trajectories in mind, we present an architectural blueprint that can help the community take the next steps towards harnessing the opportunities in autonomous science with the potential for 100x discovery acceleration and transformational scientific workflows.

[184] A Markovian Framing of WaveFunctionCollapse for Procedurally Generating Aesthetically Complex Environments

Franklin Yiu, Mohan Lu, Nina Li, Kevin Joseph, Tianxu Zhang, Julian Togelius, Timothy Merino, Sam Earle

Main category: cs.AI

TL;DR: Reformulating WaveFunctionCollapse as a Markov Decision Process to separate constraint satisfaction from objective optimization, outperforming traditional joint optimization approaches.

Details

Motivation: Address the challenge of simultaneously satisfying designer objectives and tile adjacency constraints in procedural content generation, where traditional methods struggle with complexity.

Method: Reformulate WaveFunctionCollapse as a Markov Decision Process (WFC-MDP), allowing external optimization algorithms to focus on objective maximization while WFC handles constraint satisfaction through its propagation mechanism.

Result: Across multiple domains with varying difficulties, optimization over WFC-MDP consistently outperforms traditional evolutionary approaches that jointly optimize global metrics and local tile placement, especially as task complexity increases.

Conclusion: Decoupling local constraint satisfaction from global objective optimization provides significant advantages over joint optimization approaches in procedural content generation.

Abstract: Procedural content generation often requires satisfying both designer-specified objectives and adjacency constraints implicitly imposed by the underlying tile set. To address the challenges of jointly optimizing both constraints and objectives, we reformulate WaveFunctionCollapse (WFC) as a Markov Decision Process (MDP), enabling external optimization algorithms to focus exclusively on objective maximization while leveraging WFC’s propagation mechanism to enforce constraint satisfaction. We empirically compare optimizing this MDP to traditional evolutionary approaches that jointly optimize global metrics and local tile placement. Across multiple domains with various difficulties, we find that joint optimization not only struggles as task complexity increases, but consistently underperforms relative to optimization over the WFC-MDP, underscoring the advantages of decoupling local constraint satisfaction from global objective optimization.

[185] Evaluation of Black-Box XAI Approaches for Predictors of Values of Boolean Formulae

Stav Armoni-Friedmann, Hana Chockler, David A. Kelly

Main category: cs.AI

TL;DR: Proposes formal measure for evaluating XAI tools on tabular Boolean function data, introduces B-ReX tool that outperforms others on benchmark.

Details

Motivation: Addressing the challenge of subjective evaluation in explainable AI, particularly for tabular data and Boolean function predictions.

Method: Extends previous work with formal importance measure based on actual causality, evaluates state-of-the-art XAI tools, and develops novel B-ReX tool based on ReX.

Result: B-ReX achieves superior performance with Jensen-Shannon divergence of 0.072 ± 0.012 on random 10-valued Boolean formulae benchmark.

Conclusion: B-ReX demonstrates better performance than other black-box XAI tools, providing a more objective evaluation framework for XAI approaches in tabular Boolean function contexts.

Abstract: Evaluating explainable AI (XAI) approaches is a challenging task in general, due to the subjectivity of explanations. In this paper, we focus on tabular data and the specific use case of AI models predicting the values of Boolean functions. We extend the previous work in this domain by proposing a formal and precise measure of importance of variables based on actual causality, and we evaluate state-of-the-art XAI tools against this measure. We also present a novel XAI tool B-ReX, based on the existing tool ReX, and demonstrate that it is superior to other black-box XAI tools on a large-scale benchmark. Specifically, B-ReX achieves a Jensen-Shannon divergence of 0.072 $\pm$ 0.012 on random 10-valued Boolean formulae

[186] GAMA: A General Anonymizing Multi-Agent System for Privacy Preservation Enhanced by Domain Rules and Disproof Method

Hailong Yang, Renhuo Zhao, Guanjin Wang, Zhaohong Deng

Main category: cs.AI

TL;DR: GAMA is a privacy-preserving multi-agent system that separates agents into private and public spaces, using anonymization and enhancement modules to protect sensitive data while maintaining performance.

Details

Motivation: LLM-based multi-agent systems need to handle privacy-sensitive data securely when using remote LLM services, requiring privacy-preserving mechanisms.

Method: Divides workspace into private/public spaces, uses anonymization, and incorporates DRKE (Domain-Rule-based Knowledge Enhancement) and DLE (Disproof-based Logic Enhancement) modules to mitigate semantic loss.

Result: Superior performance on Trivia Creative Writing and Logic Grid Puzzle datasets, and exceptional effectiveness on new privacy-focused datasets for both task processing and privacy preservation.

Conclusion: GAMA successfully addresses privacy concerns in LLM-based multi-agent systems while maintaining high performance through its innovative architecture and enhancement modules.

Abstract: With the rapid advancement of Large Language Model (LLM), LLM-based agents exhibit exceptional abilities in understanding and generating natural language, facilitating human-like collaboration and information transmission in LLM-based Multi-Agent System (MAS). High-performance LLMs are often hosted on remote servers in public spaces. When tasks involve privacy data, MAS cannot securely utilize these LLMs without implementing privacy-preserving mechanisms. To address this challenge, we propose a General Anonymizing Multi-Agent system (GAMA), which divides the agents’ workspace into private and public spaces and protects privacy through the anonymizing mechanism. In the private space, agents handle sensitive data, while in the public space, only anonymized data is utilized. GAMA incorporates two key modules to mitigate semantic loss caused by anonymization: Domain-Rule-based Knowledge Enhancement (DRKE) and Disproof-based Logic Enhancement (DLE). We evaluate GAMA on two public question-answering datasets: Trivia Creative Writing and Logic Grid Puzzle. The results demonstrate that GAMA has superior performance compared to the state-of-the-art models. To further assess its privacy-preserving capabilities, we designed two new datasets: Knowledge Privacy Preservation and Logic Privacy Preservation. The final results highlight GAMA’s exceptional effectiveness in both task processing and privacy preservation.

[187] XAgents: A Unified Framework for Multi-Agent Cooperation via IF-THEN Rules and Multipolar Task Processing Graph

Hailong Yang, Mingxian Gu, Jianqi Wang, Guanjin Wang, Zhaohong Deng

Main category: cs.AI

TL;DR: XAgents is a multi-agent framework that uses multipolar task graphs and IF-THEN rules to improve task planning and handle uncertainty in complex tasks, outperforming state-of-the-art approaches.

Details

Motivation: Multi-agent systems struggle with effective task planning for highly complex tasks with uncertainty, often producing misleading outputs that hinder execution.

Method: A unified multi-agent cooperative framework built on multipolar task processing graphs and IF-THEN rules for dynamic task planning and behavior constraints.

Result: XAgents consistently surpasses state-of-the-art single-agent and multi-agent approaches across three distinct datasets in knowledge-typed and logic-typed question-answering tasks.

Conclusion: The proposed framework effectively addresses task uncertainty and improves collaborative task execution through structured planning and rule-based constraints.

Abstract: The rapid advancement of Large Language Models (LLMs) has significantly enhanced the capabilities of Multi-Agent Systems (MAS) in supporting humans with complex, real-world tasks. However, MAS still face challenges in effective task planning when handling highly complex tasks with uncertainty, often resulting in misleading or incorrect outputs that hinder task execution. To address this, we propose XAgents, a unified multi-agent cooperative framework built on a multipolar task processing graph and IF-THEN rules. XAgents uses the multipolar task processing graph to enable dynamic task planning and handle task uncertainty. During subtask processing, it integrates domain-specific IF-THEN rules to constrain agent behaviors, while global rules enhance inter-agent collaboration. We evaluate the performance of XAgents across three distinct datasets, demonstrating that it consistently surpasses state-of-the-art single-agent and multi-agent approaches in both knowledge-typed and logic-typed question-answering tasks. The codes for XAgents are available at: https://github.com/AGI-FHBC/XAgents.

[188] AI Harmonics: a human-centric and harms severity-adaptive AI risk assessment framework

Sofia Vei, Paolo Giudici, Pavlos Sermpezis, Athena Vakali, Adelaide Emma Bernardelli

Main category: cs.AI

TL;DR: AI Harmonics is a human-centric AI risk assessment framework that uses empirical incident data and a novel ordinal severity metric to identify and prioritize AI harms, with political and physical harms showing the highest urgency for mitigation.

Details

Motivation: Current AI risk assessment models focus on internal compliance and neglect diverse stakeholder perspectives and real-world consequences, failing to address the unprecedented societal harms introduced by AI dominance.

Method: Proposes AI Harmonics framework with a novel AI harm assessment metric (AIH) that leverages ordinal severity data to capture relative impact without precise numerical estimates. Uses empirical incident data and combines generalized methodology with stakeholder-aware framework.

Result: Experiments on annotated incident data show political and physical harms have the highest concentration - political harms erode public trust while physical harms pose life-threatening risks. Framework consistently identifies uneven harm distributions.

Conclusion: AI Harmonics enables policymakers and organizations to effectively target mitigation efforts by providing a data-driven, human-centric approach to prioritize AI harms based on real-world severity and stakeholder impact.

Abstract: The absolute dominance of Artificial Intelligence (AI) introduces unprecedented societal harms and risks. Existing AI risk assessment models focus on internal compliance, often neglecting diverse stakeholder perspectives and real-world consequences. We propose a paradigm shift to a human-centric, harm-severity adaptive approach grounded in empirical incident data. We present AI Harmonics, which includes a novel AI harm assessment metric (AIH) that leverages ordinal severity data to capture relative impact without requiring precise numerical estimates. AI Harmonics combines a robust, generalized methodology with a data-driven, stakeholder-aware framework for exploring and prioritizing AI harms. Experiments on annotated incident data confirm that political and physical harms exhibit the highest concentration and thus warrant urgent mitigation: political harms erode public trust, while physical harms pose serious, even life-threatening risks, underscoring the real-world relevance of our approach. Finally, we demonstrate that AI Harmonics consistently identifies uneven harm distributions, enabling policymakers and organizations to target their mitigation efforts effectively.

[189] Virtual Agent Economies

Nenad Tomasev, Matija Franklin, Joel Z. Leibo, Julian Jacobs, William A. Cunningham, Iason Gabriel, Simon Osindero

Main category: cs.AI

TL;DR: The paper proposes a “sandbox economy” framework to analyze emerging AI agent economies, categorizing them by origin (emergent vs intentional) and separateness (permeable vs impermeable), and suggests design approaches for safe, steerable AI markets.

Details

Motivation: The rapid adoption of autonomous AI agents is creating new economic systems that operate beyond human oversight, presenting both opportunities for unprecedented coordination and risks including systemic economic threats and inequality.

Method: The authors propose a framework analyzing AI agent economies along two dimensions: origins (emergent vs intentional) and separateness from human economy (permeable vs impermeable). They discuss design choices including auction mechanisms, AI “mission economies” for collective goals, and socio-technical infrastructure.

Result: The analysis suggests current trajectory leads to spontaneous emergence of vast, permeable AI agent economies. The paper identifies opportunities for coordination but also significant challenges requiring proactive design interventions.

Conclusion: Proactive design of steerable AI agent markets is necessary to ensure the technological shift aligns with humanity’s long-term collective flourishing, through mechanisms like fair resource allocation, coordinated mission economies, and trust infrastructure.

Abstract: The rapid adoption of autonomous AI agents is giving rise to a new economic layer where agents transact and coordinate at scales and speeds beyond direct human oversight. We propose the “sandbox economy” as a framework for analyzing this emergent system, characterizing it along two key dimensions: its origins (emergent vs. intentional) and its degree of separateness from the established human economy (permeable vs. impermeable). Our current trajectory points toward a spontaneous emergence of a vast and highly permeable AI agent economy, presenting us with opportunities for an unprecedented degree of coordination as well as significant challenges, including systemic economic risk and exacerbated inequality. Here we discuss a number of possible design choices that may lead to safely steerable AI agent markets. In particular, we consider auction mechanisms for fair resource allocation and preference resolution, the design of AI “mission economies” to coordinate around achieving collective goals, and socio-technical infrastructure needed to ensure trust, safety, and accountability. By doing this, we argue for the proactive design of steerable agent markets to ensure the coming technological shift aligns with humanity’s long-term collective flourishing.

[190] Online Robust Planning under Model Uncertainty: A Sample-Based Approach

Tamir Shazman, Idan Lev-Yehudi, Ron Benchetit, Vadim Indelman

Main category: cs.AI

TL;DR: RSS is the first online planning algorithm for Robust MDPs with finite-sample theoretical guarantees, providing robust value estimation in uncertain environments while maintaining computational efficiency.

Details

Motivation: Existing online planning methods like Sparse Sampling and MCTS perform poorly when generative models have approximation errors from limited data, leading to degraded performance and unsafe behaviors. Robust MDPs address uncertainty but are computationally intensive for real-time use.

Method: Robust Sparse Sampling (RSS) computes robust value functions using Sample Average Approximation (SAA) instead of nominal value estimation. It works in infinite/continuous state spaces with sample and computational complexities independent of state space size.

Result: RSS provides theoretical performance guarantees and empirically outperforms standard Sparse Sampling in environments with uncertain dynamics.

Conclusion: RSS enables tractable robust policy computation for online planning under model uncertainty, combining the efficiency of sample-based methods with the robustness of RMDP frameworks.

Abstract: Online planning in Markov Decision Processes (MDPs) enables agents to make sequential decisions by simulating future trajectories from the current state, making it well-suited for large-scale or dynamic environments. Sample-based methods such as Sparse Sampling and Monte Carlo Tree Search (MCTS) are widely adopted for their ability to approximate optimal actions using a generative model. However, in practical settings, the generative model is often learned from limited data, introducing approximation errors that can degrade performance or lead to unsafe behaviors. To address these challenges, Robust MDPs (RMDPs) offer a principled framework for planning under model uncertainty, yet existing approaches are typically computationally intensive and not suited for real-time use. In this work, we introduce Robust Sparse Sampling (RSS), the first online planning algorithm for RMDPs with finite-sample theoretical performance guarantees. Unlike Sparse Sampling, which estimates the nominal value function, RSS computes a robust value function by leveraging the efficiency and theoretical properties of Sample Average Approximation (SAA), enabling tractable robust policy computation in online settings. RSS is applicable to infinite or continuous state spaces, and its sample and computational complexities are independent of the state space size. We provide theoretical performance guarantees and empirically show that RSS outperforms standard Sparse Sampling in environments with uncertain dynamics.

[191] Compartmentalised Agentic Reasoning for Clinical NLI

Maël Jullien, Lei Xu, Marco Valentino, André Freitas

Main category: cs.AI

TL;DR: CARENLI framework improves clinical NLI by separating knowledge access from structured inference through family-specific solvers and auditable procedures, achieving up to 42% fidelity gains.

Details

Motivation: To address the assumption that scaling alone improves structured reasoning in clinical NLI, and to create safer, auditable reasoning by making inference processes explicit rather than relying on heuristics.

Method: CARENLI compartmentalizes reasoning into four families (Causal Attribution, Compositional Grounding, Epistemic Verification, Risk State Abstraction) with specialized solvers, and uses planner, verifier, and refiner components for auditable procedures.

Result: Achieved 98.0% fidelity in Causal Attribution and 81.2% in Risk State Abstraction across four LLMs, with verifiers showing near-ceiling reliability and refiners correcting substantial epistemic errors. Main bottleneck identified as family classification.

Conclusion: LLMs retain relevant facts but default to heuristics when inference is underspecified; CARENLI provides a framework for explicit, safer reasoning while exposing routing as the primary challenge.

Abstract: A common assumption holds that scaling data and parameters yields increasingly structured, generalisable internal representations. We interrogate this assumption in clinical natural language inference (NLI) by adopting a benchmark decomposed into four reasoning families, Causal Attribution, Compositional Grounding, Epistemic Verification, and Risk State Abstraction, and introducing CARENLI, a Compartmentalised Agentic Reasoning for Clinical NLI that separates knowledge access from principled inference. CARENLI routes each premise, statement pair to a family specific solver and enforces auditable procedures via a planner, verifier, and refiner. Across four LLMs, CARENLI improves fidelity by up to 42 points, reaching 98.0% in Causal Attribution and 81.2% in Risk State Abstraction. Verifiers flag violations with near-ceiling reliability, while refiners correct a substantial share of epistemic errors. Remaining failures cluster in routing, identifying family classification as the main bottleneck. These results show that LLMs often retain relevant facts but default to heuristics when inference is underspecified, a dissociation CARENLI makes explicit while offering a framework for safer, auditable reasoning.

[192] Investigating Language Model Capabilities to Represent and Process Formal Knowledge: A Preliminary Study to Assist Ontology Engineering

Hanna Abi Akl

Main category: cs.AI

TL;DR: Using formal methods with Small Language Models improves reasoning performance, enabling substitution of natural language with compact logical languages while maintaining strong reasoning capabilities for ontology engineering tasks.

Details

Motivation: Recent Language Models have limitations in reasoning tasks, particularly impacting ontology engineering. The research aims to address these shortcomings by incorporating formal methods to enhance Small Language Models' reasoning capabilities.

Method: Conducted preliminary experiments to test the impact of expressing logical problems with different grammars on SLM performance. Specifically compared natural language with more compact logical languages in predefined reasoning tasks.

Result: Findings show it’s possible to substitute Natural Language with compact logical languages while maintaining strong performance on reasoning tasks.

Conclusion: The results provide a foundation for refining the role of Small Language Models in ontology engineering and demonstrate the viability of using formal methods to enhance reasoning capabilities.

Abstract: Recent advances in Language Models (LMs) have failed to mask their shortcomings particularly in the domain of reasoning. This limitation impacts several tasks, most notably those involving ontology engineering. As part of a PhD research, we investigate the consequences of incorporating formal methods on the performance of Small Language Models (SLMs) on reasoning tasks. Specifically, we aim to orient our work toward using SLMs to bootstrap ontology construction and set up a series of preliminary experiments to determine the impact of expressing logical problems with different grammars on the performance of SLMs on a predefined reasoning task. Our findings show that it is possible to substitute Natural Language (NL) with a more compact logical language while maintaining a strong performance on reasoning tasks and hope to use these results to further refine the role of SLMs in ontology engineering.

[193] The Morality of Probability: How Implicit Moral Biases in LLMs May Shape the Future of Human-AI Symbiosis

Eoin O’Doherty, Nicole Weinrauch, Andrew Talone, Uri Klempner, Xiaoyuan Yi, Xing Xie, Yi Zeng

Main category: cs.AI

TL;DR: Study examines moral value biases in large language models, finding consistent preference for Care/Virtue frameworks and penalization of libertarian choices across 6 LLMs, with reasoning models showing better context sensitivity.

Details

Motivation: To understand how AI systems prioritize moral values and assess prospects for human-AI symbiosis by examining implicit moral preferences in state-of-the-art LLMs.

Method: Quantitative experiment with six LLMs, ranking and scoring outcomes across 18 dilemmas representing five moral frameworks, analyzing effects of model architecture, cultural origin, and explainability.

Result: Strikingly consistent value biases - Care and Virtue values rated most moral across all models, libertarian choices consistently penalized. Reasoning models showed greater context sensitivity and richer explanations, while non-reasoning models produced uniform but opaque judgments.

Conclusion: Research contributes empirical comparison of moral reasoning across culturally distinct LLMs, links probabilistic behavior with value encodings, and highlights need for explainability and cultural awareness as critical design principles for transparent and aligned AI.

Abstract: Artificial intelligence (AI) is advancing at a pace that raises urgent questions about how to align machine decision-making with human moral values. This working paper investigates how leading AI systems prioritize moral outcomes and what this reveals about the prospects for human-AI symbiosis. We address two central questions: (1) What moral values do state-of-the-art large language models (LLMs) implicitly favour when confronted with dilemmas? (2) How do differences in model architecture, cultural origin, and explainability affect these moral preferences? To explore these questions, we conduct a quantitative experiment with six LLMs, ranking and scoring outcomes across 18 dilemmas representing five moral frameworks. Our findings uncover strikingly consistent value biases. Across all models, Care and Virtue values outcomes were rated most moral, while libertarian choices were consistently penalized. Reasoning-enabled models exhibited greater sensitivity to context and provided richer explanations, whereas non-reasoning models produced more uniform but opaque judgments. This research makes three contributions: (i) Empirically, it delivers a large-scale comparison of moral reasoning across culturally distinct LLMs; (ii) Theoretically, it links probabilistic model behaviour with underlying value encodings; (iii) Practically, it highlights the need for explainability and cultural awareness as critical design principles to guide AI toward a transparent, aligned, and symbiotic future.

[194] State Algebra for Propositional Logic

Dmitry Lesnik, Tobias Schäfer

Main category: cs.AI

TL;DR: State Algebra is a novel algebraic framework for representing and manipulating propositional logic through three hierarchical representations (Set, Coordinate, Row Decomposition), offering flexibility in representation while enabling efficient computation through algebraic methods.

Details

Motivation: To create a flexible framework for propositional logic that bridges well-known semantics with powerful algebraic computation, allowing for more compact problem representations and supporting both search-based and knowledge compilation approaches.

Method: Develops a hierarchy of three representations: Set, Coordinate, and Row Decomposition. Uses algebraic methods for computation and shows how canonical forms can be achieved through fixed variable ordering during reduction processes.

Result: The framework demonstrates that while default state vector reduction is not canonical, canonical forms can be obtained with fixed variable ordering. This trade-off provides increased flexibility and potentially more compact representations for certain problem classes.

Conclusion: State Algebra provides a versatile algebraic foundation for propositional logic manipulation that supports various algorithms and naturally extends to probabilistic logic and Weighted Model Counting applications.

Abstract: This paper presents State Algebra, a novel framework designed to represent and manipulate propositional logic using algebraic methods. The framework is structured as a hierarchy of three representations: Set, Coordinate, and Row Decomposition. These representations anchor the system in well-known semantics while facilitating the computation using a powerful algebraic engine. A key aspect of State Algebra is its flexibility in representation. We show that although the default reduction of a state vector is not canonical, a unique canonical form can be obtained by applying a fixed variable order during the reduction process. This highlights a trade-off: by foregoing guaranteed canonicity, the framework gains increased flexibility, potentially leading to more compact representations of certain classes of problems. We explore how this framework provides tools to articulate both search-based and knowledge compilation algorithms and discuss its natural extension to probabilistic logic and Weighted Model Counting.

[195] Abduct, Act, Predict: Scaffolding Causal Inference for Automated Failure Attribution in Multi-Agent Systems

Alva West, Yixuan Weng, Minjun Zhu, Zhen Lin, Yue Zhang

Main category: cs.AI

TL;DR: A2P Scaffolding is a novel agent framework that transforms failure attribution from pattern recognition to structured causal inference, achieving 2.85x improvement in step-level accuracy over baselines.

Details

Motivation: Current methods for failure attribution in multi-agent systems have critically low step-level accuracy (below 17%) due to inability to perform robust counterfactual reasoning, making them impractical for debugging complex systems.

Method: Abduct-Act-Predict (A2P) Scaffolding - a three-step reasoning process: (1) Abduction to infer hidden root causes, (2) Action to define minimal corrective intervention, and (3) Prediction to simulate subsequent trajectory and verify if intervention resolves failure.

Result: On Algorithm-Generated dataset: 47.46% step-level accuracy (2.85x improvement over 16.67% baseline). On Hand-Crafted dataset: 29.31% step accuracy (2.43x improvement over 12.07% baseline).

Conclusion: By reframing failure attribution through a causal lens, A2P Scaffolding provides a robust, verifiable, and significantly more accurate solution for automated failure attribution in multi-agent systems.

Abstract: Failure attribution in multi-agent systems – pinpointing the exact step where a decisive error occurs – is a critical yet unsolved challenge. Current methods treat this as a pattern recognition task over long conversation logs, leading to critically low step-level accuracy (below 17%), which renders them impractical for debugging complex systems. Their core weakness is a fundamental inability to perform robust counterfactual reasoning: to determine if correcting a single action would have actually averted the task failure. To bridge this counterfactual inference gap, we introduce Abduct-Act-Predict (A2P) Scaffolding, a novel agent framework that transforms failure attribution from pattern recognition into a structured causal inference task. A2P explicitly guides a large language model through a formal three-step reasoning process within a single inference pass: (1) Abduction, to infer the hidden root causes behind an agent’s actions; (2) Action, to define a minimal corrective intervention; and (3) Prediction, to simulate the subsequent trajectory and verify if the intervention resolves the failure. This structured approach leverages the holistic context of the entire conversation while imposing a rigorous causal logic on the model’s analysis. Our extensive experiments on the Who&When benchmark demonstrate its efficacy. On the Algorithm-Generated dataset, A2P achieves 47.46% step-level accuracy, a 2.85$\times$ improvement over the 16.67% of the baseline. On the more complex Hand-Crafted dataset, it achieves 29.31% step accuracy, a 2.43$\times$ improvement over the baseline’s 12.07%. By reframing the problem through a causal lens, A2P Scaffolding provides a robust, verifiable, and significantly more accurate solution for automated failure attribution.

[196] Mutual Information Tracks Policy Coherence in Reinforcement Learning

Cameron Reid, Wael Hafez, Amirhossein Nazeri

Main category: cs.AI

TL;DR: Information-theoretic framework for RL agents that uses mutual information patterns to both understand learning dynamics and diagnose deployment-time anomalies like sensor/actuator faults.

Details

Motivation: RL agents lack intrinsic mechanisms to detect and diagnose real-world failures like sensor faults, actuator wear, and environmental shifts during deployment.

Method: Analyze state-action mutual information patterns in robotic control tasks, examining how information metrics change during learning and under controlled perturbation experiments.

Result: Successful learning shows characteristic signatures: state-action mutual information increases 238% (0.84 to 2.83 bits), while joint mutual information follows inverted U-curve. Information metrics can differentially diagnose faults - sensor faults cause broad information collapse, while actuator faults selectively disrupt action-outcome predictability.

Conclusion: Information patterns serve as both signatures of learning and diagnostics for system health, enabling adaptive RL systems with autonomous fault detection and policy adjustment based on information-theoretic principles.

Abstract: Reinforcement Learning (RL) agents deployed in real-world environments face degradation from sensor faults, actuator wear, and environmental shifts, yet lack intrinsic mechanisms to detect and diagnose these failures. We present an information-theoretic framework that reveals both the fundamental dynamics of RL and provides practical methods for diagnosing deployment-time anomalies. Through analysis of state-action mutual information patterns in a robotic control task, we first demonstrate that successful learning exhibits characteristic information signatures: mutual information between states and actions steadily increases from 0.84 to 2.83 bits (238% growth) despite growing state entropy, indicating that agents develop increasingly selective attention to task-relevant patterns. Intriguingly, states, actions and next states joint mutual information, MI(S,A;S’), follows an inverted U-curve, peaking during early learning before declining as the agent specializes suggesting a transition from broad exploration to efficient exploitation. More immediately actionable, we show that information metrics can differentially diagnose system failures: observation-space, i.e., states noise (sensor faults) produces broad collapses across all information channels with pronounced drops in state-action coupling, while action-space noise (actuator faults) selectively disrupts action-outcome predictability while preserving state-action relationships. This differential diagnostic capability demonstrated through controlled perturbation experiments enables precise fault localization without architectural modifications or performance degradation. By establishing information patterns as both signatures of learning and diagnostic for system health, we provide the foundation for adaptive RL systems capable of autonomous fault detection and policy adjustment based on information-theoretic principles.

[197] Spatio-Temporal Graphical Counterfactuals: An Overview

Mingyu Kang, Duxin Chen, Ziyuan Pu, Jianxi Gao, Wenwu Yu

Main category: cs.AI

TL;DR: Survey comparing counterfactual thinking models (Potential Outcome Model and Structural Causal Model) and proposing unified graphical causal framework for spatio-temporal counterfactual inference.

Details

Motivation: Counterfactual thinking is crucial for AI to learn from data and improve performance in new scenarios, but existing models have different approaches and lack graphical methods for spatio-temporal counterfactuals considering spatial and temporal interactions.

Method: Conducts a comparative survey of different counterfactual models, theories, and approaches, then builds a unified graphical causal framework specifically designed for inferring spatio-temporal counterfactuals.

Result: The paper provides a comprehensive comparison of counterfactual thinking methodologies and develops a novel graphical framework that can handle spatial and temporal interactions between multiple units for counterfactual inference.

Conclusion: A unified graphical causal framework is established to address the gap in spatio-temporal counterfactual inference, enabling better counterfactual analysis that considers both spatial and temporal dimensions across multiple units.

Abstract: Counterfactual thinking is a critical yet challenging topic for artificial intelligence to learn knowledge from data and ultimately improve their performances for new scenarios. Many research works, including Potential Outcome Model and Structural Causal Model, have been proposed to realize it. However, their modelings, theoretical foundations and application approaches are usually different. Moreover, there is a lack of graphical approach to infer spatio-temporal counterfactuals, that considers spatial and temporal interactions between multiple units. Thus, in this work, our aim is to investigate a survey to compare and discuss different counterfactual models, theories and approaches, and further build a unified graphical causal frameworks to infer the spatio-temporal counterfactuals.

[198] Learning to Plan with Personalized Preferences

Manjie Xu, Xinyi Yang, Wei Liang, Chi Zhang, Yixin Zhu

Main category: cs.AI

TL;DR: The paper introduces Preference-based Planning (PbP) benchmark to address the limitation of current embodied AI agents that overlook personal preferences in planning, demonstrating that learning preferences from few demonstrations improves personalized plan generation.

Details

Motivation: Current embodied intelligence approaches adopt generalized methods that fail to account for individual human preferences in collaborative scenarios, limiting effective AI integration into daily life.

Method: Developed agents that learn preferences from few demonstrations and adapt planning strategies accordingly. Introduced PbP benchmark with hundreds of diverse preferences ranging from atomic actions to complex sequences. Evaluated state-of-the-art methods and incorporated learned preferences as intermediate representations in planning.

Result: Symbol-based approaches show promise in scalability but face challenges in generating and executing personalized preference-satisfying plans. Incorporating learned preferences as intermediate representations significantly improves agents’ ability to construct personalized plans.

Conclusion: Preferences serve as a valuable abstraction layer for adaptive planning, opening new research directions for preference-guided plan generation and execution in embodied AI systems.

Abstract: Effective integration of AI agents into daily life requires them to understand and adapt to individual human preferences, particularly in collaborative roles. Although recent studies on embodied intelligence have advanced significantly, they typically adopt generalized approaches that overlook personal preferences in planning. We address this limitation by developing agents that not only learn preferences from few demonstrations but also learn to adapt their planning strategies based on these preferences. Our research leverages the observation that preferences, though implicitly expressed through minimal demonstrations, can generalize across diverse planning scenarios. To systematically evaluate this hypothesis, we introduce Preference-based Planning (PbP) benchmark, an embodied benchmark featuring hundreds of diverse preferences spanning from atomic actions to complex sequences. Our evaluation of SOTA methods reveals that while symbol-based approaches show promise in scalability, significant challenges remain in learning to generate and execute plans that satisfy personalized preferences. We further demonstrate that incorporating learned preferences as intermediate representations in planning significantly improves the agent’s ability to construct personalized plans. These findings establish preferences as a valuable abstraction layer for adaptive planning, opening new directions for research in preference-guided plan generation and execution.

[199] QuantX: A Framework for Hardware-Aware Quantization of Generative AI Workloads

Muhammad Ahmad, Khurram Mazher, Saqib Akram, Ahmad Tameem, Saad Bin Nasir

Main category: cs.AI

TL;DR: QuantX is a quantization suite that enables 3-bit quantization of LLMs/VLMs with minimal performance loss (<6% degradation), outperforms SOTA methods, and provides hardware-aware optimization for efficient inference.

Details

Motivation: To address the computational and memory constraints of deploying large language and vision models by developing efficient quantization techniques that maintain performance while reducing model size and inference costs.

Method: Develops a suite of quantization recipes that consider hardware-specific constraints for efficient dequantization, enabling flexible trade-offs between runtime speed, memory requirements, and model accuracy. Includes integration with Llama.cpp framework.

Result: Achieves performance within 6% of unquantized models for LlaVa-v1.6 at 3-bit quantization across multiple tasks, outperforms state-of-the-art quantization techniques, and demonstrates practical runtime feasibility.

Conclusion: QuantX provides effective quantization solutions down to 3-bit resolutions with minimal performance loss, offering valuable insights and practical recipes for efficient LLM/VLM deployment while maintaining competitive accuracy.

Abstract: We present QuantX: a tailored suite of recipes for LLM and VLM quantization. It is capable of quantizing down to 3-bit resolutions with minimal loss in performance. The quantization strategies in QuantX take into account hardware-specific constraints to achieve efficient dequantization during inference ensuring flexible trade-off between runtime speed, memory requirement and model accuracy. Our results demonstrate that QuantX achieves performance within 6% of the unquantized model for LlaVa-v1.6 quantized down to 3-bits for multiple end user tasks and outperforms recently published state-of-the-art quantization techniques. We further integrate one particular technique from QuantX into the popular Llama.cpp framework and show its feasibility in terms of runtime compared to the mainstream quantization techniques from Llama.cpp. Lastly, this manuscript provides insights into the LLM quantization process that motivated the range of recipes and options that are incorporated in QuantX.

[200] Building Self-Evolving Agents via Experience-Driven Lifelong Learning: A Framework and Benchmark

Yuxuan Cai, Yipeng Hao, Jie Zhou, Hang Yan, Zhikai Lei, Rui Zhen, Zhenhua Han, Yutao Yang, Junsong Li, Qianjun Pan, Tianyu Huai, Qin Chen, Xin Li, Kai Chen, Bo Zhang, Xipeng Qiu, Liang He

Main category: cs.AI

TL;DR: The paper introduces Experience-driven Lifelong Learning (ELL), a framework for creating self-evolving AI agents that learn continuously through real-world interaction, built on four core principles: experience exploration, long-term memory, skill learning, and knowledge internalization.

Details

Motivation: As AI advances toward general intelligence, there's a need to shift from systems optimized for static tasks to creating open-ended agents that can learn continuously through real-world interaction and self-evolution.

Method: The ELL framework is based on four core principles: (1) Experience Exploration - continuous self-motivated interaction with dynamic environments, (2) Long-term Memory - preserving and structuring historical knowledge, (3) Skill Learning - abstracting patterns into reusable skills, and (4) Knowledge Internalization - converting explicit experiences into intuitive capabilities. The authors also introduce StuLife benchmark dataset simulating a student’s college journey.

Result: The paper presents a comprehensive framework for lifelong learning agents but does not provide specific experimental results in the abstract. The StuLife benchmark is introduced as an evaluation tool for the ELL framework.

Conclusion: The ELL framework represents a significant step toward building self-evolving AI agents capable of continuous growth through real-world interaction, with the StuLife benchmark providing a concrete testbed for evaluating lifelong learning capabilities in complex, dynamic environments.

Abstract: As AI advances toward general intelligence, the focus is shifting from systems optimized for static tasks to creating open-ended agents that learn continuously. In this paper, we introduce Experience-driven Lifelong Learning (ELL), a framework for building self-evolving agents capable of continuous growth through real-world interaction. The framework is built on four core principles: (1) Experience Exploration: Agents learn through continuous, self-motivated interaction with dynamic environments, navigating interdependent tasks and generating rich experiential trajectories. (2) Long-term Memory: Agents preserve and structure historical knowledge, including personal experiences, domain expertise, and commonsense reasoning, into a persistent memory system. (3) Skill Learning: Agents autonomously improve by abstracting recurring patterns from experience into reusable skills, which are actively refined and validated for application in new tasks. (4) Knowledge Internalization: Agents internalize explicit and discrete experiences into implicit and intuitive capabilities as “second nature”. We also introduce StuLife, a benchmark dataset for ELL that simulates a student’s holistic college journey, from enrollment to academic and personal development, across three core phases and ten detailed sub-scenarios. StuLife is designed around three key paradigm

[201] Oyster-I: Beyond Refusal – Constructive Safety Alignment for Responsible Language Models

Ranjie Duan, Jiexi Liu, Xiaojun Jia, Shiji Zhao, Ruoxi Cheng, Fengxiang Wang, Cheng Wei, Yong Xie, Chang Liu, Defeng Li, Yinpeng Dong, Yichi Zhang, Yuefeng Chen, Chongwen Wang, Xingjun Ma, Xingxing Wei, Yang Liu, Hang Su, Jun Zhu, Xinfeng Li, Yitong Sun, Jie Zhang, Jinzhao Hu, Sha Xu, Yitong Yang, Jialing Tao, Hui Xue

Main category: cs.AI

TL;DR: CSA is a new safety paradigm that shifts from refusal-based responses to constructive guidance, especially for vulnerable users, achieving state-of-the-art safety while maintaining helpfulness.

Details

Motivation: Current LLM safety approaches focus too much on adversarial risks and simple refusals, which can worsen outcomes for non-malicious users in psychological distress who need constructive help.

Method: Constructive Safety Alignment (CSA) combines game-theoretic anticipation of user reactions, fine-grained risk boundary discovery, and interpretable reasoning control, implemented in Oyster-I model.

Result: Oy1 achieves SOTA safety among open models, strong constructive engagement close to GPT-5, and unmatched robustness on jailbreak datasets nearing GPT-o1 levels.

Conclusion: CSA redefines model-user relationships by shifting from refusal-first to guidance-first safety, creating systems that are both safe and meaningfully helpful for all users.

Abstract: Large language models (LLMs) typically deploy safety mechanisms to prevent harmful content generation. Most current approaches focus narrowly on risks posed by malicious actors, often framing risks as adversarial events and relying on defensive refusals. However, in real-world settings, risks also come from non-malicious users seeking help while under psychological distress (e.g., self-harm intentions). In such cases, the model’s response can strongly influence the user’s next actions. Simple refusals may lead them to repeat, escalate, or move to unsafe platforms, creating worse outcomes. We introduce Constructive Safety Alignment (CSA), a human-centric paradigm that protects against malicious misuse while actively guiding vulnerable users toward safe and helpful results. Implemented in Oyster-I (Oy1), CSA combines game-theoretic anticipation of user reactions, fine-grained risk boundary discovery, and interpretable reasoning control, turning safety into a trust-building process. Oy1 achieves state-of-the-art safety among open models while retaining high general capabilities. On our Constructive Benchmark, it shows strong constructive engagement, close to GPT-5, and unmatched robustness on the Strata-Sword jailbreak dataset, nearing GPT-o1 levels. By shifting from refusal-first to guidance-first safety, CSA redefines the model-user relationship, aiming for systems that are not just safe, but meaningfully helpful. We release Oy1, code, and the benchmark to support responsible, user-centered AI.

[202] ForTIFAI: Fending Off Recursive Training Induced Failure for AI Models

Soheil Zibakhsh Shabgahi, Pedram Aghazadeh, Azalia Mirhoseini, Farinaz Koushanfar

Main category: cs.AI

TL;DR: Proposes Truncated Cross Entropy (TCE) loss function to mitigate model collapse in generative AI by downweighting high-confidence predictions during recursive training on synthetic data.

Details

Motivation: Increasing reliance on generative AI leads to synthetic data dominating training sets by 2030, causing model collapse where performance degrades over generations of training on synthetic data.

Method: Identifies model overconfidence as key driver of collapse and proposes confidence-aware loss function (TCE) that downweights high-confidence predictions during training.

Result: TCE significantly delays model collapse, extending model’s fidelity interval before collapse by more than 2.3x, and generalizes across different modalities.

Conclusion: Loss function design provides a simple yet powerful tool for preserving generative model quality in the era of increasing synthetic data usage.

Abstract: The increasing reliance on generative AI models has accelerated the generation rate of synthetic data, with some projections suggesting that most available new data for training could be machine-generated by 2030. This shift to a mainly synthetic content presents a critical challenge: repeated training in synthetic data leads to a phenomenon known as model collapse, where model performance degrades over generations of training, eventually rendering the models ineffective. Although prior studies have explored the causes and detection of model collapse, existing mitigation strategies remain limited. In this paper, we identify model overconfidence in their self-generated data as a key driver of collapse. Building on this observation, we propose a confidence-aware loss function that downweights high-confidence predictions during training. We introduce a novel loss function we call Truncated Cross Entropy (TCE). We demonstrate that TCE significantly delays model collapse in recursive training. We provide a model-agnostic framework that links the loss function design to model collapse mitigation and validate our approach both theoretically and empirically, showing that it can extend the model’s fidelity interval before collapse by more than 2.3x. Finally, we show that our method generalizes across modalities. These findings suggest that the design of loss functions provides a simple yet powerful tool for preserving the quality of generative models in the era of increasing synthetic data.

[203] TORSO: Template-Oriented Reasoning Towards General Tasks

Minhyuk Kim, Seungyoon Lee, Heuiseok Lim

Main category: cs.AI

TL;DR: TORSO enables LLMs to use their internal reasoning capabilities without relying on manually crafted few-shot examples, achieving strong performance across diverse tasks.

Details

Motivation: Existing few-shot prompting approaches heavily depend on provided examples, limiting model's inherent reasoning capabilities and requiring costly task-specific prompt construction.

Method: Template-Oriented Reasoning (TORSO) elicits models to utilize internal reasoning abilities to generate proper responses across various tasks without manually crafted few-shot examples.

Result: TORSO achieves strong performance on diverse LLMs benchmarks with reasonable rationales.

Conclusion: TORSO provides an effective alternative to few-shot prompting that leverages models’ internal reasoning capabilities without task-specific manual example construction.

Abstract: The approaches that guide Large Language Models (LLMs) to emulate human reasoning during response generation have emerged as an effective method for enabling them to solve complex problems in a step-by-step manner, thereby achieving superior performance. However, most existing approaches using few-shot prompts to generate responses heavily depend on the provided examples, limiting the utilization of the model’s inherent reasoning capabilities. Moreover, constructing task-specific few-shot prompts is often costly and may lead to inconsistencies across different tasks. In this work, we introduce Template-Oriented Reasoning (TORSO), which elicits the model to utilize internal reasoning abilities to generate proper responses across various tasks without the need for manually crafted few-shot examples. Our experimental results demonstrate that TORSO achieves strong performance on diverse LLMs benchmarks with reasonable rationales.

cs.SD

[204] VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions

Jun Zhan, Mingyang Han, Yuxuan Xie, Chen Wang, Dong Zhang, Kexin Huang, Haoxiang Shi, DongXiao Wang, Tengtao Song, Qinyuan Cheng, Shimin Li, Jun Song, Xipeng Qiu, Bo Zheng

Main category: cs.SD

TL;DR: The paper introduces Voice Style Adaptation (VSA) task and VStyle benchmark to evaluate spoken language models’ ability to adapt speaking style based on spoken instructions, revealing current limitations and providing evaluation tools.

Details

Motivation: While spoken language models have advanced in semantic accuracy and instruction following, their ability to adapt speaking style (timbre, prosody, persona) based on spoken commands has received limited attention, creating a gap in natural human-machine interaction.

Method: The authors introduce VStyle, a bilingual (Chinese & English) benchmark covering four speech generation categories, and develop LALM as a Judge framework for progressive evaluation along textual faithfulness, style adherence, and naturalness dimensions.

Result: Experiments on commercial systems and open source SLMs show that current models face clear limitations in controllable style adaptation, demonstrating both the novelty and challenge of the VSA task.

Conclusion: The VStyle benchmark and evaluation toolkit provide a foundation for advancing human-centered spoken interaction, highlighting the need for improved style adaptation capabilities in spoken language models.

Abstract: Spoken language models (SLMs) have emerged as a unified paradigm for speech understanding and generation, enabling natural human machine interaction. However, while most progress has focused on semantic accuracy and instruction following, the ability of SLMs to adapt their speaking style based on spoken instructions has received limited attention. We introduce Voice Style Adaptation (VSA), a new task that examines whether SLMs can modify their speaking style, such as timbre, prosody, or persona following natural language spoken commands. To study this task, we present VStyle, a bilingual (Chinese & English) benchmark covering four categories of speech generation: acoustic attributes, natural language instruction, role play, and implicit empathy. We also introduce the Large Audio Language Model as a Judge (LALM as a Judge) framework, which progressively evaluates outputs along textual faithfulness, style adherence, and naturalness, ensuring reproducible and objective assessment. Experiments on commercial systems and open source SLMs demonstrate that current models face clear limitations in controllable style adaptation, highlighting both the novelty and challenge of this task. By releasing VStyle and its evaluation toolkit, we aim to provide the community with a foundation for advancing human centered spoken interaction. The dataset and code are publicly available at \href{https://junzhan2000.github.io/VStyle.github.io/}{project’s homepage}.

[205] Testing chatbots on the creation of encoders for audio conditioned image generation

Jorge E. León, Miguel Carrasco

Main category: cs.SD

TL;DR: Chatbots were prompted to design audio encoders to replace CLIP text encoder in Stable Diffusion 1.5 for image generation from sound, but none achieved satisfactory results despite valid architectures.

Details

Motivation: To explore whether state-of-the-art conversational agents can design effective audio encoders to enable image synthesis directly from sound, leveraging their coding capabilities.

Method: Prompted five publicly available chatbots to propose neural architectures for audio encoders, trained each valid suggestion on over 2M audio-image-text observations, and evaluated on validation/test sets with various metrics and qualitative analysis.

Result: Although chatbots generated valid model designs, none achieved satisfactory results - audio embeddings failed to align reliably with original text encoder. Gemini showed best quantitative metrics, Grok produced more coherent images when paired with text encoder.

Conclusion: Findings reveal shared architectural bias across chatbots and underscore remaining coding gap. Researchers should perform more specialized tasks to fully test chatbots’ creativity and reasoning beyond well-known solutions.

Abstract: On one hand, recent advances in chatbots has led to a rising popularity in using these models for coding tasks. On the other hand, modern generative image models primarily rely on text encoders to translate semantic concepts into visual representations, even when there is clear evidence that audio can be employed as input as well. Given the previous, in this work, we explore whether state-of-the-art conversational agents can design effective audio encoders to replace the CLIP text encoder from Stable Diffusion 1.5, enabling image synthesis directly from sound. We prompted five publicly available chatbots to propose neural architectures to work as these audio encoders, with a set of well-explained shared conditions. Each valid suggested encoder was trained on over two million context related audio-image-text observations, and evaluated on held-out validation and test sets using various metrics, together with a qualitative analysis of their generated images. Although almost all chatbots generated valid model designs, none achieved satisfactory results, indicating that their audio embeddings failed to align reliably with those of the original text encoder. Among the proposals, the Gemini audio encoder showed the best quantitative metrics, while the Grok audio encoder produced more coherent images (particularly, when paired with the text encoder). Our findings reveal a shared architectural bias across chatbots and underscore the remaining coding gap that needs to be bridged in future versions of these models. We also created a public demo so everyone could study and try out these audio encoders. Finally, we propose research questions that should be tackled in the future, and encourage other researchers to perform more focused and highly specialized tasks like this one, so the respective chatbots cannot make use of well-known solutions and their creativity/reasoning is fully tested.

[206] AI-enabled tuberculosis screening in a high-burden setting using cough sound analysis and speech foundation models

Ning Ma, Bahman Mirheidari, Guy J. Brown, Minyoi M. Maimbolwa, Nsala Sanjase, Solomon Chifwamba, Seke Muzazu, Monde Muyoyeta, Mary Kagujje

Main category: cs.SD

TL;DR: AI cough analysis using deep learning and speech foundation models shows strong potential for TB screening, achieving 92.1% AUROC when combined with demographic/clinical data, meeting WHO benchmarks.

Details

Motivation: TB screening in high-burden, low-resource settings needs scalable solutions. Previous AI cough analysis studies were limited by small datasets, under-representation of non-TB symptomatic patients, simple models, and ideal recording conditions.

Method: Collected cough recordings from 500 participants in Zambia (TB+, other respiratory diseases, healthy controls). Trained deep learning classifiers based on speech foundation models on 3-second cough segments, then enhanced with demographic and clinical features.

Result: Audio-only classifier achieved 85.2% AUROC (TB+/Rest) and 80.1% (TB+/OR). Multimodal model with additional features improved to 92.1% and 84.2% respectively, with 90.3% sensitivity and 73.1% specificity at optimal threshold.

Conclusion: Cough analysis with speech foundation models, especially combined with demographic/clinical data, shows strong potential as a TB triage tool meeting WHO standards. Model is robust to confounding factors but requires further validation across diverse regions and case definitions before clinical use.

Abstract: Background Artificial intelligence (AI) can detect disease-related acoustic patterns in cough sounds, offering a scalable approach to tuberculosis (TB) screening in high-burden, low-resource settings. Previous studies have been limited by small datasets, under-representation of symptomatic non-TB patients, reliance on simple models, and recordings collected under idealised conditions. Methods We enrolled 512 participants at two hospitals in Zambia, grouped as bacteriologically confirmed TB (TB+), symptomatic patients with other respiratory diseases (OR), and healthy controls (HC). Usable cough recordings plus demographic and clinical data were obtained from 500 participants. Deep learning classifiers based on speech foundation models were trained on cough recordings. The best-performing model, trained on 3-second segments, was further evaluated with demographic and clinical features. Findings The best audio-only classifier achieved an AUROC of 85.2% for distinguishing TB+ from all others (TB+/Rest) and 80.1% for TB+ versus OR. Adding demographic and clinical features improved performance to 92.1% (TB+/Rest) and 84.2% (TB+/OR). At a threshold of 0.38, the multimodal model reached 90.3% sensitivity and 73.1% specificity for TB+/Rest, and 80.6% and 73.1% for TB+/OR. Interpretation Cough analysis using speech foundation models, especially when combined with demographic and clinical data, showed strong potential as a TB triage tool, meeting WHO target product profile benchmarks. The model was robust to confounding factors including background noise, recording time, and device variability, indicating detection of genuine disease-related acoustic patterns. Further validation across diverse regions and case definitions, including subclinical TB, is required before clinical use.

[207] DiTReducio: A Training-Free Acceleration for DiT-Based TTS via Progressive Calibration

Yanru Huo, Ziyue Jiang, Zuoli Tang, Qingyang Hong, Zhou Zhao

Main category: cs.SD

TL;DR: DiTReducio is a training-free acceleration framework that reduces computational demands in Diffusion Transformer TTS models through temporal and branch skipping compression methods, achieving 75.4% FLOPs reduction and 37.1% RTF improvement while maintaining quality.

Details

Motivation: Diffusion Transformers (DiT) for speech synthesis have high computational demands that limit their practical deployment. Existing acceleration approaches focus on reducing sampling steps through distillation but remain constrained by training costs.

Method: Proposed DiTReducio framework with two compression methods: Temporal Skipping and Branch Skipping to eliminate redundant computations during inference. Uses pattern-guided strategy based on characteristic attention patterns in DiT layers to selectively apply compression with adjustable thresholds.

Result: Achieves 75.4% reduction in FLOPs and 37.1% improvement in Real-Time Factor (RTF) while preserving generation quality, as demonstrated on F5-TTS and MegaTTS 3 models.

Conclusion: DiTReducio provides an effective training-free acceleration solution for DiT-based TTS models, enabling flexible trade-off between computational efficiency and generation quality without additional training costs.

Abstract: While Diffusion Transformers (DiT) have advanced non-autoregressive (NAR) speech synthesis, their high computational demands remain an limitation. Existing DiT-based text-to-speech (TTS) model acceleration approaches mainly focus on reducing sampling steps through distillation techniques, yet they remain constrained by training costs. We introduce DiTReducio, a training-free acceleration framework that compresses computations in DiT-based TTS models via progressive calibration. We propose two compression methods, Temporal Skipping and Branch Skipping, to eliminate redundant computations during inference. Moreover, based on two characteristic attention patterns identified within DiT layers, we devise a pattern-guided strategy to selectively apply the compression methods. Our method allows flexible modulation between generation quality and computational efficiency through adjustable compression thresholds. Experimental evaluations conducted on F5-TTS and MegaTTS 3 demonstrate that DiTReducio achieves a 75.4% reduction in FLOPs and improves the Real-Time Factor (RTF) by 37.1%, while preserving generation quality.

[208] Combining Textual and Spectral Features for Robust Classification of Pilot Communications

Abdullah All Tanvir, Chenyu Huang, Moe Alahmad, Chuyang Yang, Xin Zhong

Main category: cs.SD

TL;DR: A dual-pipeline ML framework using both textual and spectral features from pilot radio communications achieves over 91% F1-score for classifying aircraft operational intent at non-towered airports.

Details

Motivation: Accurate estimation of aircraft operations is critical for airport management but challenging at non-towered facilities lacking surveillance infrastructure.

Method: Dual pipeline ML framework using both textual (from automatic speech recognition) and spectral (Mel-spectrogram) features from pilot radio communications. Evaluated traditional classifiers, ensemble methods, LSTM, and CNN models.

Result: Spectral features combined with deep architectures consistently yield superior performance with F1-scores exceeding 91%. Data augmentation improves robustness to real-world audio variability.

Conclusion: The approach is scalable, cost-effective, deployable without additional infrastructure, and offers a practical solution for air traffic monitoring at general aviation airports.

Abstract: Accurate estimation of aircraft operations, such as takeoffs and landings, is critical for effective airport management, yet remains challenging, especially at non-towered facilities lacking dedicated surveillance infrastructure. This paper presents a novel dual pipeline machine learning framework that classifies pilot radio communications using both textual and spectral features. Audio data collected from a non-towered U.S. airport was annotated by certified pilots with operational intent labels and preprocessed through automatic speech recognition and Mel-spectrogram extraction. We evaluate a wide range of traditional classifiers and deep learning models, including ensemble methods, LSTM, and CNN across both pipelines. To our knowledge, this is the first system to classify operational aircraft intent using a dual-pipeline ML framework on real-world air traffic audio. Our results demonstrate that spectral features combined with deep architectures consistently yield superior classification performance, with F1-scores exceeding 91%. Data augmentation further improves robustness to real-world audio variability. The proposed approach is scalable, cost-effective, and deployable without additional infrastructure, offering a practical solution for air traffic monitoring at general aviation airports.

[209] SoilSound: Smartphone-based Soil Moisture Estimation

Yixuan Gao, Tanvir Ahmed, Shuang He, Zhongqi Cheng, Rajalakshmi Nandakumar

Main category: cs.SD

TL;DR: SoilSound is a smartphone-based acoustic sensing system that measures soil moisture non-invasively using built-in speaker/microphone and a vertical scan mechanism with CNN processing, achieving 2.39% MAE across various soil types.

Details

Motivation: Existing soil moisture monitoring methods require invasive probes or specialized equipment, limiting public accessibility and disturbing the soil during measurement.

Method: Uses smartphone speakers to send acoustic chirps toward soil and records reflections during vertical scanning. Processes data with convolutional neural network for on-device moisture estimation based on surface roughness effect model.

Result: Achieves mean absolute error of 2.39% across 10 different locations, accurately tracking soil moisture levels from 15.9% to 34.0% across multiple soil types, environments, and users.

Conclusion: SoilSound enables widespread, non-invasive soil moisture monitoring without calibration or soil disturbance, making it accessible for home gardeners, urban farmers, and resource-limited agricultural communities.

Abstract: Soil moisture monitoring is essential for agriculture and environmental management, yet existing methods require either invasive probes disturbing the soil or specialized equipment, limiting access to the public. We present SoilSound, an ubiquitous accessible smartphone-based acoustic sensing system that can measure soil moisture without disturbing the soil. We leverage the built-in speaker and microphone to perform a vertical scan mechanism to accurately measure moisture without any calibration. Unlike existing work that use transmissive properties, we propose an alternate model for acoustic reflections in soil based on the surface roughness effect to enable moisture sensing without disturbing the soil. The system works by sending acoustic chirps towards the soil and recording the reflections during a vertical scan, which are then processed and fed to a convolutional neural network for on-device soil moisture estimation with negligible computational, memory, or power overhead. We evaluated the system by training with curated soils in boxes in the lab and testing in the outdoor fields and show that SoilSound achieves a mean absolute error (MAE) of 2.39% across 10 different locations. Overall, the evaluation shows that SoilSound can accurately track soil moisture levels ranging from 15.9% to 34.0% across multiple soil types, environments, and users; without requiring any calibration or disturbing the soil, enabling widespread moisture monitoring for home gardeners, urban farmers, citizen scientists, and agricultural communities in resource-limited settings.

[210] CoDiCodec: Unifying Continuous and Discrete Compressed Representations of Audio

Marco Pasini, Stefan Lattner, George Fazekas

Main category: cs.SD

TL;DR: CoDiCodec is a novel audio autoencoder that efficiently compresses audio into both continuous embeddings (~11Hz) and discrete tokens (2.38kbps) from the same model, outperforming existing methods at similar bitrates.

Details

Motivation: Existing audio autoencoders force a choice between continuous embeddings and discrete tokens, and struggle to achieve high compression ratios while maintaining audio fidelity.

Method: Uses Finite Scalar Quantization (FSQ) and novel FSQ-dropout technique with a single consistency loss for end-to-end training. Supports both autoregressive and parallel decoding strategies.

Result: Outperforms existing continuous and discrete autoencoders at similar bitrates in reconstruction audio quality. Parallel decoding achieves superior quality and faster decoding.

Conclusion: Enables unified audio compression approach, bridging the gap between continuous and discrete generative modeling paradigms with unprecedented flexibility for downstream tasks.

Abstract: Efficiently representing audio signals in a compressed latent space is critical for latent generative modelling. However, existing autoencoders often force a choice between continuous embeddings and discrete tokens. Furthermore, achieving high compression ratios while maintaining audio fidelity remains a challenge. We introduce CoDiCodec, a novel audio autoencoder that overcomes these limitations by both efficiently encoding global features via summary embeddings, and by producing both compressed continuous embeddings at ~ 11 Hz and discrete tokens at a rate of 2.38 kbps from the same trained model, offering unprecedented flexibility for different downstream generative tasks. This is achieved through Finite Scalar Quantization (FSQ) and a novel FSQ-dropout technique, and does not require additional loss terms beyond the single consistency loss used for end-to-end training. CoDiCodec supports both autoregressive decoding and a novel parallel decoding strategy, with the latter achieving superior audio quality and faster decoding. CoDiCodec outperforms existing continuous and discrete autoencoders at similar bitrates in terms of reconstruction audio quality. Our work enables a unified approach to audio compression, bridging the gap between continuous and discrete generative modelling paradigms.

[211] Prototypical Contrastive Learning For Improved Few-Shot Audio Classification

Christos Sgouropoulos, Christos Nikou, Stefanos Vlachos, Vasileios Theiou, Christos Foukanelis, Theodoros Giannakopoulos

Main category: cs.SD

TL;DR: Integrating supervised contrastive loss with prototypical few-shot learning for audio classification achieves state-of-the-art performance on MetaAudio benchmark.

Details

Motivation: Few-shot learning in audio classification remains underexplored compared to image domain, and there's a need to improve performance with limited labeled data.

Method: Combines supervised contrastive loss (angular loss variant) with prototypical few-shot training, using SpecAugment and self-attention to create unified embeddings from augmented inputs.

Result: Achieves state-of-the-art performance in 5-way, 5-shot setting on MetaAudio benchmark with five datasets.

Conclusion: Angular contrastive loss improves few-shot audio classification performance, demonstrating the effectiveness of integrating contrastive learning with prototypical networks.

Abstract: Few-shot learning has emerged as a powerful paradigm for training models with limited labeled data, addressing challenges in scenarios where large-scale annotation is impractical. While extensive research has been conducted in the image domain, few-shot learning in audio classification remains relatively underexplored. In this work, we investigate the effect of integrating supervised contrastive loss into prototypical few shot training for audio classification. In detail, we demonstrate that angular loss further improves the performance compared to the standard contrastive loss. Our method leverages SpecAugment followed by a self-attention mechanism to encapsulate diverse information of augmented input versions into one unified embedding. We evaluate our approach on MetaAudio, a benchmark including five datasets with predefined splits, standardized preprocessing, and a comprehensive set of few-shot learning models for comparison. The proposed approach achieves state-of-the-art performance in a 5-way, 5-shot setting.

[212] Data-independent Beamforming for End-to-end Multichannel Multi-speaker ASR

Can Cui, Paul Magron, Mostafa Sadeghi, Emmanuel Vincent

Main category: cs.SD

TL;DR: Beamforming approach using spherical polar coordinates improves multichannel multi-speaker ASR performance without training, reducing WER by 11% and improving speaker counting by 27%.

Details

Motivation: ASR in multichannel multi-speaker scenarios faces challenges from ambient noise, reverberation, and overlapping speakers, requiring better signal processing methods.

Method: Data-independent, training-free beamforming that processes specific angular sectors based on spherical polar coordinates before applying end-to-end multichannel multi-speaker ASR system.

Result: Using beamformed signals improves ASR performance compared to raw microphone signals, with increasing signals further enhancing accuracy. On AMI meeting corpus: 11% WER reduction and 27% speaker counting improvement.

Conclusion: Proposed beamforming method enables more efficient use of multichannel signals while reducing input load for ASR systems, significantly improving recognition accuracy in challenging acoustic environments.

Abstract: Automatic speech recognition (ASR) in multichannel, multi-speaker scenarios remains challenging due to ambient noise, reverberation and overlapping speakers. In this paper, we propose a beamforming approach that processes specific angular sectors based on their spherical polar coordinates before applying an end-to-end multichannel, multi-speaker ASR system. This method is data-independent and training-free. We demonstrate that using a group of beamformed signals improves ASR performance compared to using the same number of raw microphone signals. Moreover, increasing the number of signals used for beamforming further enhances recognition accuracy, leading to a more efficient use of multichannel signals while reducing the overall input load for the ASR system. We conduct experiments on the AMI meeting corpus, where the proposed method reduces word error rate by up to 11% and improves speaker counting accuracy by up to 27% relative compared to a multichannel ASR baseline system that does not exploit beamforming.

[213] Improving Audio Event Recognition with Consistency Regularization

Shanmuka Sadhu, Weiran Wang

Main category: cs.SD

TL;DR: Consistency regularization improves audio event recognition on AudioSet, showing gains in both supervised and semi-supervised setups with different training set sizes.

Details

Motivation: To apply consistency regularization (CR) - which enforces prediction agreement on augmented views - to audio event recognition, building on its recent success in automatic speech recognition.

Method: Extensive ablation studies on AudioSet with small (~20k) and large (~1.8M) supervised training sets, using CR with data augmentation. Also extended to semi-supervised setup with 20K labeled and 1.8M unlabeled samples.

Result: CR brings consistent improvement over supervised baselines that already use heavy data augmentation. Stronger augmentation and multiple augmentations provide additional gains for small training sets. Semi-supervised setup with unlabeled data further improves performance.

Conclusion: Consistency regularization is effective for audio event recognition, working well across different dataset sizes and in both supervised and semi-supervised learning scenarios.

Abstract: Consistency regularization (CR), which enforces agreement between model predictions on augmented views, has found recent benefits in automatic speech recognition [1]. In this paper, we propose the use of consistency regularization for audio event recognition, and demonstrate its effectiveness on AudioSet. With extensive ablation studies for both small ($\sim$20k) and large ($\sim$1.8M) supervised training sets, we show that CR brings consistent improvement over supervised baselines which already heavily utilize data augmentation, and CR using stronger augmentation and multiple augmentations leads to additional gain for the small training set. Furthermore, we extend the use of CR into the semi-supervised setup with 20K labeled samples and 1.8M unlabeled samples, and obtain performance improvement over our best model trained on the small set.

[214] Towards Reliable Audio Deepfake Attribution and Model Recognition: A Multi-Level Autoencoder-Based Framework

Andrea Di Pierno, Luca Guarnera, Dario Allegra, Sebastiano Battiato

Main category: cs.SD

TL;DR: LAVA is a hierarchical framework for audio deepfake detection and model attribution that uses attention-enhanced latent representations and specialized classifiers to identify both generation technology and specific model instances with high accuracy.

Details

Motivation: Audio deepfakes pose a growing threat to digital trust, and while detection methods exist, attributing deepfakes to their source models remains an underexplored but crucial challenge for accountability and forensic analysis.

Method: Hierarchical framework using a convolutional autoencoder trained on fake audio to extract attention-enhanced latent representations, with two specialized classifiers: Audio Deepfake Attribution (ADA) for generation technology identification and Audio Deepfake Model Recognition (ADMR) for specific model recognition, incorporating confidence-based rejection thresholds for open-set robustness.

Result: Strong performance across multiple datasets: ADA achieves F1-scores over 95% across all datasets, ADMR reaches 96.31% macro F1 across six classes, with confirmed robustness on unseen attacks from ASVspoof2019 LA and error propagation analysis.

Conclusion: LAVA advances the field by introducing a supervised approach to deepfake attribution and model recognition under open-set conditions, validated on public benchmarks with publicly released models and code for reproducibility and further research.

Abstract: The proliferation of audio deepfakes poses a growing threat to trust in digital communications. While detection methods have advanced, attributing audio deepfakes to their source models remains an underexplored yet crucial challenge. In this paper we introduce LAVA (Layered Architecture for Voice Attribution), a hierarchical framework for audio deepfake detection and model recognition that leverages attention-enhanced latent representations extracted by a convolutional autoencoder trained solely on fake audio. Two specialized classifiers operate on these features: Audio Deepfake Attribution (ADA), which identifies the generation technology, and Audio Deepfake Model Recognition (ADMR), which recognize the specific generative model instance. To improve robustness under open-set conditions, we incorporate confidence-based rejection thresholds. Experiments on ASVspoof2021, FakeOrReal, and CodecFake show strong performance: the ADA classifier achieves F1-scores over 95% across all datasets, and the ADMR module reaches 96.31% macro F1 across six classes. Additional tests on unseen attacks from ASVpoof2019 LA and error propagation analysis confirm LAVA’s robustness and reliability. The framework advances the field by introducing a supervised approach to deepfake attribution and model recognition under open-set conditions, validated on public benchmarks and accompanied by publicly released models and code. Models and code are available at https://www.github.com/adipiz99/lava-framework.

[215] Finite Scalar Quantization Enables Redundant and Transmission-Robust Neural Audio Compression at Low Bit-rates

Harry Julian, Rachel Beeson, Lohith Konathala, Johanna Ulin, Jiameng Gao

Main category: cs.SD

TL;DR: NeuCodec is a neural audio codec using Finite Scalar Quantization (FSQ) instead of traditional RVQ, showing superior robustness to noisy channels while maintaining comparable audio quality.

Details

Motivation: Existing neural audio codecs rely on Residual Vector Quantization (RVQ), but FSQ offers simpler training and native single codebook support. The paper aims to explore FSQ's potential for creating more robust audio codecs that perform well in noisy transmission scenarios.

Method: Developed NeuCodec, an FSQ-based neural audio codec. Conducted encoder distillation experiments to show different encoders can produce vastly different code sequences for identical audio. Compared RVQ and FSQ codecs by simulating transmission through noisy channels to test bit-level perturbation robustness.

Result: FSQ encodes baked-in redundancy that makes the encoding robust to noisy channels. Two different encoders learned to encode identical audio into different code sequences while maintaining comparable reconstruction quality. FSQ demonstrated vastly superior bit-level perturbation robustness compared to RVQ when transmitting through noisy channels.

Conclusion: FSQ-based neural audio codecs like NeuCodec offer significant advantages over traditional RVQ approaches, particularly in robustness to noisy transmission channels, while maintaining audio reconstruction quality and offering simpler training with single codebook support.

Abstract: Neural Audio Codecs (NACs) have become increasingly adopted in speech processing tasks due to their excellent rate-distortion performance and compatibility with Large Language Models (LLMs) as discrete feature representations for audio generation. While most existing codecs rely on Residual Vector Quantization (RVQ), Finite Scalar Quantization (FSQ) has recently emerged as a compelling alternative that simplifies training and natively supports single codebooks. We introduce NeuCodec, an FSQ-based NAC, and show that FSQ encodes baked-in redundancy which produces an encoding which is robust when transmitted through noisy channels. First, through an encoder distillation experiment, we show that two different encoders can learn to encode identical audio into vastly different code sequences whilst maintaining comparable reconstruction quality with the same quantizer and decoder. Second, we demonstrate that FSQ has vastly superior bit-level perturbation robustness by comparing the performance of RVQ and FSQ codecs when simulating the transmission of code sequences through a noisy channel.

[216] DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech

Ngoc-Son Nguyen, Hieu-Nghia Huynh-Nguyen, Thanh V. T. Tran, Truong-Son Hy, Van Nguyen

Main category: cs.SD

TL;DR: DiFlow-TTS is the first discrete flow matching model for zero-shot TTS that achieves fast inference (25.8x speedup) while maintaining high quality in naturalness, prosody, and speaker style preservation.

Details

Motivation: Existing zero-shot TTS methods suffer from slow inference speeds and repetition artifacts. Current flow-matching approaches embed discrete tokens into continuous space, failing to fully leverage the advantages of discrete representations for speech synthesis.

Method: DiFlow-TTS uses purely discrete flow matching with factorized speech attributes. It employs in-context learning by conditioning on text content and extracted prosodic/acoustic attributes from reference speech, with separate prediction heads for prosody and acoustic details.

Result: The model achieves promising performance in naturalness, prosody, speaker style preservation, and energy control. It maintains compact size and generates speech 25.8 times faster than latest baselines.

Conclusion: Discrete flow matching is an effective approach for zero-shot TTS that enables fast inference while maintaining high synthesis quality across multiple metrics.

Abstract: Zero-shot Text-to-Speech (TTS) aims to synthesize high-quality speech that mimics the voice of an unseen speaker using only a short reference sample, requiring not only speaker adaptation but also accurate modeling of prosodic attributes. Recent approaches based on language models, diffusion, and flow matching have shown promising results in zero-shot TTS, but still suffer from slow inference and repetition artifacts. Discrete codec representations have been widely adopted for speech synthesis, and recent works have begun to explore diffusion models in purely discrete settings, suggesting the potential of discrete generative modeling for speech synthesis. However, existing flow-matching methods typically embed these discrete tokens into a continuous space and apply continuous flow matching, which may not fully leverage the advantages of discrete representations. To address these challenges, we introduce DiFlow-TTS, which, to the best of our knowledge, is the first model to explore purely Discrete Flow Matching for speech synthesis. DiFlow-TTS explicitly models factorized speech attributes within a compact and unified architecture. It leverages in-context learning by conditioning on textual content, along with prosodic and acoustic attributes extracted from a reference speech, enabling effective attribute cloning in a zero-shot setting. In addition, the model employs a factorized flow prediction mechanism with distinct heads for prosody and acoustic details, allowing it to learn aspect-specific distributions. Experimental results demonstrate that DiFlow-TTS achieves promising performance in several key metrics, including naturalness, prosody, preservation of speaker style, and energy control. It also maintains a compact model size and achieves low-latency inference, generating speech up to 25.8 times faster than the latest existing baselines.

cs.LG

[217] Structure Matters: Brain Graph Augmentation via Learnable Edge Masking for Data-efficient Psychiatric Diagnosis

Mujie Liu, Chenze Wang, Liping Chen, Nguyen Linh Dan Le, Niharika Tewari, Ting Dang, Jiangang Ma, Feng Xia

Main category: cs.LG

TL;DR: SAM-BG is a two-stage self-supervised learning framework that preserves structural semantics in brain graphs for psychiatric diagnosis, using edge masking and structure-aware augmentation to achieve better performance with limited labeled data.

Details

Motivation: Limited labeled brain network data makes psychiatric diagnosis challenging, and existing SSL methods often disrupt crucial structural semantics in brain graphs through inappropriate augmentation strategies.

Method: Two-stage framework: 1) Pre-training stage trains an edge masker on small labeled subset to capture structural semantics; 2) SSL stage uses extracted structural priors to guide structure-aware augmentation for learning semantically meaningful representations.

Result: Outperforms state-of-the-art methods on two real-world psychiatric datasets, especially in small-labeled data settings, and uncovers clinically relevant connectivity patterns that enhance interpretability.

Conclusion: SAM-BG effectively preserves structural semantics in brain graphs, enabling more accurate psychiatric diagnosis with limited labeled data while maintaining interpretability through clinically relevant pattern discovery.

Abstract: The limited availability of labeled brain network data makes it challenging to achieve accurate and interpretable psychiatric diagnoses. While self-supervised learning (SSL) offers a promising solution, existing methods often rely on augmentation strategies that can disrupt crucial structural semantics in brain graphs. To address this, we propose SAM-BG, a two-stage framework for learning brain graph representations with structural semantic preservation. In the pre-training stage, an edge masker is trained on a small labeled subset to capture key structural semantics. In the SSL stage, the extracted structural priors guide a structure-aware augmentation process, enabling the model to learn more semantically meaningful and robust representations. Experiments on two real-world psychiatric datasets demonstrate that SAM-BG outperforms state-of-the-art methods, particularly in small-labeled data settings, and uncovers clinically relevant connectivity patterns that enhance interpretability. Our code is available at https://github.com/mjliu99/SAM-BG.

[218] D-CAT: Decoupled Cross-Attention Transfer between Sensor Modalities for Unimodal Inference

Leen Daher, Zhaobo Wang, Malcolm Mielle

Main category: cs.LG

TL;DR: D-CAT enables cross-modal transfer learning without requiring paired sensor data during inference, allowing single-sensor deployment while maintaining performance gains from multi-modal training.

Details

Motivation: Existing cross-modal transfer methods require paired sensor data at both training and inference, limiting deployment in resource-constrained environments where full sensor suites are not feasible.

Method: Proposes Decoupled Cross-Attention Transfer (D-CAT) with self-attention for feature extraction and a novel cross-attention alignment loss to align modality-specific representations without coupling classification pipelines.

Result: Achieves up to 10% F1-score gains in in-distribution scenarios (video to IMU transfer) and improves target performance in out-of-distribution scenarios even with weaker source modalities.

Conclusion: D-CAT enables single-sensor inference with cross-modal knowledge, reducing hardware redundancy while maintaining accuracy for cost-sensitive deployments like assistive robots.

Abstract: Cross-modal transfer learning is used to improve multi-modal classification models (e.g., for human activity recognition in human-robot collaboration). However, existing methods require paired sensor data at both training and inference, limiting deployment in resource-constrained environments where full sensor suites are not economically and technically usable. To address this, we propose Decoupled Cross-Attention Transfer (D-CAT), a framework that aligns modality-specific representations without requiring joint sensor modality during inference. Our approach combines a self-attention module for feature extraction with a novel cross-attention alignment loss, which enforces the alignment of sensors’ feature spaces without requiring the coupling of the classification pipelines of both modalities. We evaluate D-CAT on three multi-modal human activity datasets (IMU, video, and audio) under both in-distribution and out-of-distribution scenarios, comparing against uni-modal models. Results show that in in-distribution scenarios, transferring from high-performing modalities (e.g., video to IMU) yields up to 10% F1-score gains over uni-modal training. In out-of-distribution scenarios, even weaker source modalities (e.g., IMU to video) improve target performance, as long as the target model isn’t overfitted on the training data. By enabling single-sensor inference with cross-modal knowledge, D-CAT reduces hardware redundancy for perception systems while maintaining accuracy, which is critical for cost-sensitive or adaptive deployments (e.g., assistive robots in homes with variable sensor availability). Code is available at https://github.com/Schindler-EPFL-Lab/D-CAT.

[219] Meta-Learning Reinforcement Learning for Crypto-Return Prediction

Junqiao Wang, Zhaoyang Guan, Guanyu Liu, Tianze Xia, Xianzhi Li, Shuo Yin, Xinyuan Song, Chuhan Cheng, Tianyu Shi, Alex Lee

Main category: cs.LG

TL;DR: Meta-RL-Crypto is a transformer-based architecture combining meta-learning and reinforcement learning to create a self-improving cryptocurrency trading agent that outperforms other LLM-based methods without human supervision.

Details

Motivation: Cryptocurrency return prediction is challenging due to fast-shifting market factors, scarce labeled data, and expensive training requirements, necessitating an automated approach.

Method: Uses a unified transformer architecture with meta-learning and RL, featuring a closed-loop system where an instruction-tuned LLM alternates between actor, judge, and meta-judge roles using multimodal market inputs and internal preference feedback.

Result: The system demonstrates good performance on real-market technical indicators and outperforms other LLM-based baselines across diverse market regimes.

Conclusion: Meta-RL-Crypto provides an effective self-supervised framework for cryptocurrency trading that continuously refines both trading policies and evaluation criteria without additional human input.

Abstract: Predicting cryptocurrency returns is notoriously difficult: price movements are driven by a fast-shifting blend of on-chain activity, news flow, and social sentiment, while labeled training data are scarce and expensive. In this paper, we present Meta-RL-Crypto, a unified transformer-based architecture that unifies meta-learning and reinforcement learning (RL) to create a fully self-improving trading agent. Starting from a vanilla instruction-tuned LLM, the agent iteratively alternates between three roles-actor, judge, and meta-judge-in a closed-loop architecture. This learning process requires no additional human supervision. It can leverage multimodal market inputs and internal preference feedback. The agent in the system continuously refines both the trading policy and evaluation criteria. Experiments across diverse market regimes demonstrate that Meta-RL-Crypto shows good performance on the technical indicators of the real market and outperforming other LLM-based baselines.

[220] LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation

Yiqun Shen, Song Yuan, Zhengze Zhang, Xiaoliang Wang, Daxin Jiang, Nguyen Cam-Tu

Main category: cs.LG

TL;DR: LAVa is a unified KV cache compression framework that minimizes information loss in Transformer residual streams, enabling dynamic budget allocation across layers and heads without training.

Details

Motivation: Existing KV cache compression methods are heuristic and lack dynamic budget allocation, leading to inefficient memory usage for long-context LLM inference.

Method: Analyzes layer attention output loss to derive a metric for comparing cache entries across heads, enabling layer-wise compression with dynamic head budgets and cross-layer information comparison for dynamic layer budgets.

Result: Superior performance on benchmarks (LongBench, Needle-In-A-Haystack, Ruler, InfiniteBench) with new insights: dynamic layer budgets crucial for generation tasks, dynamic head budgets key for extraction tasks.

Conclusion: LAVa is the first unified strategy for cache eviction and dynamic budget allocation that maintains top performance across task types without relying on training or multiple strategies.

Abstract: KV Cache is commonly used to accelerate LLM inference with long contexts, yet its high memory demand drives the need for cache compression. Existing compression methods, however, are largely heuristic and lack dynamic budget allocation. To address this limitation, we introduce a unified framework for cache compression by minimizing information loss in Transformer residual streams. Building on it, we analyze the layer attention output loss and derive a new metric to compare cache entries across heads, enabling layer-wise compression with dynamic head budgets. Additionally, by contrasting cross-layer information, we also achieve dynamic layer budgets. LAVa is the first unified strategy for cache eviction and dynamic budget allocation that, unlike prior methods, does not rely on training or the combination of multiple strategies. Experiments with benchmarks (LongBench, Needle-In-A-Haystack, Ruler, and InfiniteBench) demonstrate its superiority. Moreover, our experiments reveal a new insight: dynamic layer budgets are crucial for generation tasks (e.g., code completion), while dynamic head budgets play a key role in extraction tasks (e.g., extractive QA). As a fully dynamic compression method, LAVa consistently maintains top performance across task types. Our code is available at https://github.com/MGDDestiny/Lava.

[221] The Overcooked Generalisation Challenge: Evaluating Cooperation with Novel Partners in Unknown Environments Using Unsupervised Environment Design

Constantin Ruhdorfer, Matteo Bortoletto, Anna Penzkofer, Andreas Bulling

Main category: cs.LG

TL;DR: The Overcooked Generalisation Challenge (OGC) is a new benchmark for evaluating RL agents’ ability to cooperate with unknown partners in unfamiliar environments, featuring dual curriculum design and GPU acceleration.

Details

Motivation: Existing cooperative RL evaluations are limited to training environments and partners, failing to assess true generalization capacity needed for human collaboration.

Method: Extends Overcooked-AI with dual curriculum design (DCD), creating a rich design space with full kitchen layouts and multiple objects that require accounting for agent interaction dynamics.

Result: Current state-of-the-art DCD algorithms and neural architectures fail to produce agents that effectively generalize to novel layouts and unfamiliar partners.

Conclusion: OGC establishes a demanding testbed for cooperative generalization, highlighting that both agents and curriculum designers struggle with joint partner and environment generalization challenges.

Abstract: We introduce the Overcooked Generalisation Challenge (OGC) - a new benchmark for evaluating reinforcement learning (RL) agents on their ability to cooperate with unknown partners in unfamiliar environments. Existing work typically evaluated cooperative RL only in their training environment or with their training partners, thus seriously limiting our ability to understand agents’ generalisation capacity - an essential requirement for future collaboration with humans. The OGC extends Overcooked-AI to support dual curriculum design (DCD). It is fully GPU-accelerated, open-source, and integrated into the minimax DCD benchmark suite. Compared to prior DCD benchmarks, where designers manipulate only minimal elements of the environment, OGC introduces a significantly richer design space: full kitchen layouts with multiple objects that require the designer to account for interaction dynamics between agents. We evaluate state-of-the-art DCD algorithms alongside scalable neural architectures and find that current methods fail to produce agents that generalise effectively to novel layouts and unfamiliar partners. Our results indicate that both agents and curriculum designers struggle with the joint challenge of partner and environment generalisation. These findings establish OGC as a demanding testbed for cooperative generalisation and highlight key directions for future research. We open-source our code.

[222] Hybrid Adaptive Conformal Offline Reinforcement Learning for Fair Population Health Management

Sanjay Basu, Sadiq Y. Patel, Parth Sheth, Bhairavi Muralidharan, Namrata Elamaran, Aakriti Kinra, Rajaie Batniji

Main category: cs.LG

TL;DR: HACO framework combines conformal prediction with offline RL to provide safe, auditable decision support for Medicaid population health management, achieving strong risk discrimination while maintaining fairness across demographic subgroups.

Details

Motivation: Population health management programs for Medicaid populations need safe, fair, and auditable coordination of outreach and services while controlling near-term risk of adverse utilization events like emergency department visits.

Method: Hybrid Adaptive Conformal Offline Reinforcement Learning (HACO) framework that separates risk calibration from preference optimization, using a lightweight risk model, conformal threshold to mask unsafe actions, and learns preference policy on safe subsets.

Result: Achieved strong risk discrimination (AUC ~0.81) with calibrated threshold, maintained high safe coverage, and revealed systematic differences in estimated value across demographic subgroups through fairness auditing.

Conclusion: Conformal risk gating integrates cleanly with offline RL to deliver conservative, auditable decision support for population health management teams, demonstrating importance of subgroup fairness analysis.

Abstract: Population health management programs for Medicaid populations coordinate longitudinal outreach and services (e.g., benefits navigation, behavioral health, social needs support, and clinical scheduling) and must be safe, fair, and auditable. We present a Hybrid Adaptive Conformal Offline Reinforcement Learning (HACO) framework that separates risk calibration from preference optimization to generate conservative action recommendations at scale. In our setting, each step involves choosing among common coordination actions (e.g., which member to contact, by which modality, and whether to route to a specialized service) while controlling the near-term risk of adverse utilization events (e.g., unplanned emergency department visits or hospitalizations). Using a de-identified operational dataset from Waymark comprising 2.77 million sequential decisions across 168,126 patients, HACO (i) trains a lightweight risk model for adverse events, (ii) derives a conformal threshold to mask unsafe actions at a target risk level, and (iii) learns a preference policy on the resulting safe subset. We evaluate policies with a version-agnostic fitted Q evaluation (FQE) on stratified subsets and audit subgroup performance across age, sex, and race. HACO achieves strong risk discrimination (AUC ~0.81) with a calibrated threshold ( {\tau} ~0.038 at {\alpha} = 0.10), while maintaining high safe coverage. Subgroup analyses reveal systematic differences in estimated value across demographics, underscoring the importance of fairness auditing. Our results show that conformal risk gating integrates cleanly with offline RL to deliver conservative, auditable decision support for population health management teams.

[223] One Head, Many Models: Cross-Attention Routing for Cost-Aware LLM Selection

Roshini Pulishetty, Mani Kishan Ghantasala, Keerthy Kaushik Dasoju, Niti Mangwani, Vishal Garimella, Aditya Mate, Somya Chatterjee, Yue Kang, Ehi Nosakhare, Sadid Hasan, Soundar Srinivasan

Main category: cs.LG

TL;DR: A unified routing framework using cross-attention to dynamically select optimal LLMs per query, achieving 6.6% AIQ improvement and better cost-performance balance.

Details

Motivation: Address the challenge of scalable, cost-effective deployment of diverse LLMs with varying computational costs and performance profiles in real-world applications.

Method: Single-head cross-attention mechanism to jointly model query and model embeddings, predicting both response quality and generation cost. Uses exponential reward function for stable performance-cost balancing.

Result: Achieves up to 6.6% improvement in Average Improvement in Quality (AIQ) and 2.9% in maximum performance over existing routers on RouterBench benchmark.

Conclusion: Lightweight architecture generalizes effectively across domains, establishes new standard for cost-aware LLM routing with improved efficiency over prior methods.

Abstract: The proliferation of large language models (LLMs) with varying computational costs and performance profiles presents a critical challenge for scalable, cost-effective deployment in real-world applications. We introduce a unified routing framework that leverages a single-head cross-attention mechanism to jointly model query and model embeddings, enabling dynamic selection of the optimal LLM for each input query. Our approach is evaluated on RouterBench, a large-scale, publicly available benchmark encompassing diverse LLM pools and domains. By explicitly capturing fine-grained query-model interactions, our router predicts both response quality and generation cost, achieving up to 6.6% improvement in Average Improvement in Quality (AIQ) and 2.9% in maximum performance over existing routers. To robustly balance performance and cost, we propose an exponential reward function that enhances stability across user preferences. The resulting architecture is lightweight, generalizes effectively across domains, and demonstrates improved efficiency compared to prior methods, establishing a new standard for cost-aware LLM routing.

[224] From the Gradient-Step Denoiser to the Proximal Denoiser and their associated convergent Plug-and-Play algorithms

Vincent Herfeld, Baudouin Denis de Senneville, Arthur Leclaire, Nicolas Papadakis

Main category: cs.LG

TL;DR: Analysis of Gradient-Step Denoiser in Plug-and-Play algorithms, showing it can serve as exact gradient descent/proximity operators while maintaining state-of-the-art denoising performance.

Details

Motivation: Plug-and-Play optimization algorithms typically use off-the-shelf denoisers as implicit image priors, but these lack explicit functional representations. The paper aims to develop a denoiser that can serve as exact mathematical operators while preserving denoising capabilities.

Method: The Gradient-Step Denoiser is trained to function as either the gradient descent operator or proximity operator of an explicit functional, providing mathematical rigor while maintaining practical denoising performance.

Result: The proposed denoiser successfully serves as exact mathematical operators (gradient descent or proximity operators) while achieving state-of-the-art denoising capabilities.

Conclusion: The Gradient-Step Denoiser bridges the gap between theoretical optimization operators and practical denoising performance, enabling more mathematically grounded Plug-and-Play algorithms with explicit functional representations.

Abstract: In this paper we analyze the Gradient-Step Denoiser and its usage in Plug-and-Play algorithms. The Plug-and-Play paradigm of optimization algorithms uses off the shelf denoisers to replace a proximity operator or a gradient descent operator of an image prior. Usually this image prior is implicit and cannot be expressed, but the Gradient-Step Denoiser is trained to be exactly the gradient descent operator or the proximity operator of an explicit functional while preserving state-of-the-art denoising capabilities.

[225] Distinguishing Startle from Surprise Events Based on Physiological Signals

Mansi Sharma, Alexandre Duchevet, Florian Daiber, Jean-Paul Imbert, Maurice Rekrut

Main category: cs.LG

TL;DR: Machine learning approach using physiological signals to distinguish between startle and surprise reactions in pilots, achieving 85.7% accuracy with SVM and Late Fusion, and 74.9% accuracy when including baseline states with XGBoost.

Details

Motivation: Unexpected events like startle and surprise impair pilot attention and decision-making, posing safety risks in aviation. These reactions are hard to distinguish in practice and have been studied separately with limited focus on combined effects or physiological differentiation.

Method: Used machine learning and multi-modal fusion strategies to analyze physiological signals for distinguishing between startle and surprise events. Evaluated SVM and XGBoost classifiers with Late Fusion approach.

Result: Achieved 85.7% mean accuracy distinguishing startle vs surprise using SVM with Late Fusion. Extended to three-class problem (Startle, Surprise, Baseline) achieved 74.9% accuracy with XGBoost and Late Fusion.

Conclusion: Physiological signals can reliably predict and differentiate between startle and surprise reactions using machine learning, providing a robust method for aviation safety applications to distinguish these critical cognitive states.

Abstract: Unexpected events can impair attention and delay decision-making, posing serious safety risks in high-risk environments such as aviation. In particular, reactions like startle and surprise can impact pilot performance in different ways, yet are often hard to distinguish in practice. Existing research has largely studied these reactions separately, with limited focus on their combined effects or how to differentiate them using physiological data. In this work, we address this gap by distinguishing between startle and surprise events based on physiological signals using machine learning and multi-modal fusion strategies. Our results demonstrate that these events can be reliably predicted, achieving a highest mean accuracy of 85.7% with SVM and Late Fusion. To further validate the robustness of our model, we extended the evaluation to include a baseline condition, successfully differentiating between Startle, Surprise, and Baseline states with a highest mean accuracy of 74.9% with XGBoost and Late Fusion.

[226] Revisiting Actor-Critic Methods in Discrete Action Off-Policy Reinforcement Learning

Reza Asad, Reza Babanezhad, Sharan Vaswani

Main category: cs.LG

TL;DR: The paper introduces a flexible off-policy actor-critic framework for discrete-action RL that improves upon DSAC by decoupling actor and critic entropy, achieving DQN-level performance on Atari games without requiring entropy regularization or explicit exploration.

Details

Motivation: Value-based methods like DQN dominate discrete-action RL, while policy-based methods either don't learn effectively from off-policy data (PPO) or perform poorly in discrete settings (SAC). The authors aim to improve actor-critic methods for discrete actions by addressing the poor performance of DSAC.

Method: The authors identify that coupling between actor and critic entropy causes DSAC’s poor performance. They propose decoupling these components and introduce a flexible framework that allows m-step Bellman updates for critics and combines policy optimization with entropy regularization for actors.

Result: The proposed methods achieve comparable performance to DQN on standard Atari games, even without entropy regularization or explicit exploration. Theoretical analysis shows convergence to optimal regularized value functions in tabular settings.

Conclusion: Decoupling actor and critic entropy is crucial for effective discrete-action actor-critic methods. The proposed flexible framework enables competitive performance with DQN while maintaining theoretical guarantees, making actor-critic approaches viable for discrete-action off-policy RL.

Abstract: Value-based approaches such as DQN are the default methods for off-policy reinforcement learning with discrete-action environments such as Atari. Common policy-based methods are either on-policy and do not effectively learn from off-policy data (e.g. PPO), or have poor empirical performance in the discrete-action setting (e.g. SAC). Consequently, starting from discrete SAC (DSAC), we revisit the design of actor-critic methods in this setting. First, we determine that the coupling between the actor and critic entropy is the primary reason behind the poor performance of DSAC. We demonstrate that by merely decoupling these components, DSAC can have comparable performance as DQN. Motivated by this insight, we introduce a flexible off-policy actor-critic framework that subsumes DSAC as a special case. Our framework allows using an m-step Bellman operator for the critic update, and enables combining standard policy optimization methods with entropy regularization to instantiate the resulting actor objective. Theoretically, we prove that the proposed methods can guarantee convergence to the optimal regularized value function in the tabular setting. Empirically, we demonstrate that these methods can approach the performance of DQN on standard Atari games, and do so even without entropy regularization or explicit exploration.

[227] Adaptive Token Merging for Efficient Transformer Semantic Communication at the Edge

Omar Erak, Omar Alhussein, Hatem Abou-Zeid, Mehdi Bennis, Sami Muhaidat

Main category: cs.LG

TL;DR: Training-free adaptive token merging framework that compresses transformer representations by merging redundant tokens based on per-layer similarity thresholds, achieving significant computational and communication savings while maintaining accuracy.

Details

Motivation: Large transformers are computationally expensive and hinder deployment on resource-constrained edge devices, requiring efficient compression methods that don't sacrifice performance.

Method: Uses adaptive token merging with per-layer similarity thresholds to selectively merge semantically redundant tokens, framed as multi-objective optimization solved via Bayesian optimization for Pareto-optimal trade-offs.

Result: Achieves 30% fewer FLOPs and under 20% communication cost on ImageNet while matching accuracy; for VQA, achieves competitive performance with LLaVA at <1/3 compute and <1/10 bandwidth; robust across channel conditions and provides privacy benefits against model inversion attacks.

Conclusion: Provides a practical, versatile solution for deploying transformers in resource-limited edge scenarios without retraining, balancing efficiency and task relevance through data-dependent adaptation.

Abstract: Large-scale transformers are central to modern semantic communication, yet their high computational and communication costs hinder deployment on resource-constrained edge devices. This paper introduces a training-free framework for adaptive token merging, a novel mechanism that compresses transformer representations at runtime by selectively merging semantically redundant tokens under per-layer similarity thresholds. Unlike prior fixed-ratio reduction, our approach couples merging directly to input redundancy, enabling data-dependent adaptation that balances efficiency and task relevance without retraining. We cast the discovery of merging strategies as a multi-objective optimization problem and leverage Bayesian optimization to obtain Pareto-optimal trade-offs between accuracy, inference cost, and communication cost. On ImageNet classification, we match the accuracy of the unmodified transformer with 30% fewer floating-point operations per second and under 20% of the original communication cost, while for visual question answering our method achieves performance competitive with the full LLaVA model at less than one-third of the compute and one-tenth of the bandwidth. Finally, we show that our adaptive merging is robust across varying channel conditions and provides inherent privacy benefits, substantially degrading the efficacy of model inversion attacks. Our framework provides a practical and versatile solution for deploying powerful transformer models in resource-limited edge intelligence scenarios.

[228] HGEN: Heterogeneous Graph Ensemble Networks

Jiajun Shen, Yufei Jin, Yi He, Xingquan Zhu

Main category: cs.LG

TL;DR: HGEN introduces ensemble learning for heterogeneous graphs using meta-path optimization and random dropping to create diverse GNN learners, with residual-attention and correlation-regularization to improve accuracy and diversity.

Details

Motivation: Heterogeneity in node types, features, and neighborhood topology poses significant challenges for ensemble learning in graphs, requiring specialized approaches to accommodate diverse graph learners.

Method: HGEN ensembles multiple learners through meta-path and transformation-based optimization, using random dropping to create Allele GNNs. It employs residual-attention mechanism to calibrate different meta-paths and correlation-regularization to increase disparity among embedding matrices.

Result: Experiments on five heterogeneous networks show HGEN consistently outperforms state-of-the-art competitors by substantial margins, with proven convergence and higher regularization effectiveness than simple voting.

Conclusion: HGEN successfully pioneers ensemble learning for heterogeneous graphs, demonstrating that meta-path optimization with specialized calibration mechanisms significantly improves classification accuracy over existing approaches.

Abstract: This paper presents HGEN that pioneers ensemble learning for heterogeneous graphs. We argue that the heterogeneity in node types, nodal features, and local neighborhood topology poses significant challenges for ensemble learning, particularly in accommodating diverse graph learners. Our HGEN framework ensembles multiple learners through a meta-path and transformation-based optimization pipeline to uplift classification accuracy. Specifically, HGEN uses meta-path combined with random dropping to create Allele Graph Neural Networks (GNNs), whereby the base graph learners are trained and aligned for later ensembling. To ensure effective ensemble learning, HGEN presents two key components: 1) a residual-attention mechanism to calibrate allele GNNs of different meta-paths, thereby enforcing node embeddings to focus on more informative graphs to improve base learner accuracy, and 2) a correlation-regularization term to enlarge the disparity among embedding matrices generated from different meta-paths, thereby enriching base learner diversity. We analyze the convergence of HGEN and attest its higher regularization magnitude over simple voting. Experiments on five heterogeneous networks validate that HGEN consistently outperforms its state-of-the-art competitors by substantial margin.

[229] Latency and Token-Aware Test-Time Compute

Jenny Y. Huang, Mehul Damani, Yousef El-Kurdi, Ramon Astudillo, Wei Sun

Main category: cs.LG

TL;DR: A framework for dynamic compute allocation and method selection during LLM inference that optimizes both token usage and latency, outperforming static strategies like best-of-N.

Details

Motivation: Existing inference-time scaling methods focus only on parallel generation and ignore latency, which is critical for user experience and agentic workflows requiring multiple efficient queries.

Method: Formulates inference-time scaling as dynamic compute allocation problem, incorporating both token cost and wall-clock latency, with per-query strategy selection.

Result: Experiments on reasoning benchmarks show consistent outperformance of static strategies, achieving better accuracy-cost trade-offs while remaining practical for deployment.

Conclusion: Dynamic compute allocation with explicit latency consideration provides superior performance and practical deployment advantages over traditional static inference strategies.

Abstract: Inference-time scaling has emerged as a powerful way to improve large language model (LLM) performance by generating multiple candidate responses and selecting among them. However, existing work on dynamic allocation for test-time compute typically considers only parallel generation methods such as best-of-N, overlooking incremental decoding methods like beam search, and has largely ignored latency, focusing only on token usage. We formulate inference-time scaling as a problem of dynamic compute allocation and method selection, where the system must decide which strategy to apply and how much compute to allocate on a per-query basis. Our framework explicitly incorporates both token cost and wall-clock latency, the latter being critical for user experience and particularly for agentic workflows where models must issue multiple queries efficiently. Experiments on reasoning benchmarks show that our approach consistently outperforms static strategies, achieving favorable accuracy-cost trade-offs while remaining practical for deployment.

[230] Variational Neural Networks for Observable Thermodynamics (V-NOTS)

Christopher Eldred, François Gay-Balmaz, Vakhtang Putkaradze

Main category: cs.LG

TL;DR: A novel neural network framework using thermodynamic Lagrangian to predict dissipative dynamical systems from observable data only, without requiring unobservable momenta and entropies, while ensuring thermodynamic consistency.

Details

Motivation: Many physical systems have unobservable phase space variables (momenta, entropies) but available data only contains observable coordinates. This makes traditional data-based computing approaches challenging for dissipative dynamical systems.

Method: Developed a thermodynamic Lagrangian-based neural network that works exclusively with observable variables, respects thermodynamics, and guarantees non-decreasing entropy evolution.

Result: The network efficiently describes phase space evolution using limited data points and a relatively small number of parameters, successfully predicting future solutions without direct observation of all phase space variables.

Conclusion: The proposed framework provides an effective data-based computing approach for dissipative systems that works with observable data only while maintaining thermodynamic consistency and prediction accuracy.

Abstract: Much attention has recently been devoted to data-based computing of evolution of physical systems. In such approaches, information about data points from past trajectories in phase space is used to reconstruct the equations of motion and to predict future solutions that have not been observed before. However, in many cases, the available data does not correspond to the variables that define the system’s phase space. We focus our attention on the important example of dissipative dynamical systems. In that case, the phase space consists of coordinates, momenta and entropies; however, the momenta and entropies cannot, in general, be observed directly. To address this difficulty, we develop an efficient data-based computing framework based exclusively on observable variables, by constructing a novel approach based on the \emph{thermodynamic Lagrangian}, and constructing neural networks that respect the thermodynamics and guarantees the non-decreasing entropy evolution. We show that our network can provide an efficient description of phase space evolution based on a limited number of data points and a relatively small number of parameters in the system.

[231] LoFT: Parameter-Efficient Fine-Tuning for Long-tailed Semi-Supervised Learning in Open-World Scenarios

Jiahao Chen, Zhiyuan Huang, Yurou Liu, Bing Su

Main category: cs.LG

TL;DR: LoFT framework extends long-tailed semi-supervised learning to foundation model fine-tuning, generating more reliable pseudo-labels and handling open-world scenarios with OOD samples.

Details

Motivation: Existing LTSSL methods train from scratch, leading to overconfidence and low-quality pseudo-labels. Foundation model fine-tuning can address these issues and handle more practical open-world scenarios.

Method: Proposes LoFT framework using parameter-efficient fine-tuning of foundation models for reliable pseudo-label generation, and LoFT-OW extension for open-world scenarios with OOD samples.

Result: Superior performance on multiple benchmarks, achieving better results even when using only 1% of unlabeled data compared to previous approaches.

Conclusion: Fine-tuning foundation models provides more reliable pseudo-labels for long-tailed SSL and effectively handles open-world scenarios with OOD contamination.

Abstract: Long-tailed learning has garnered increasing attention due to its wide applicability in real-world scenarios. Among existing approaches, Long-Tailed Semi-Supervised Learning (LTSSL) has emerged as an effective solution by incorporating a large amount of unlabeled data into the imbalanced labeled dataset. However, most prior LTSSL methods are designed to train models from scratch, which often leads to issues such as overconfidence and low-quality pseudo-labels. To address these challenges, we extend LTSSL into the foundation model fine-tuning paradigm and propose a novel framework: LoFT (Long-tailed semi-supervised learning via parameter-efficient Fine-Tuning). We demonstrate that fine-tuned foundation models can generate more reliable pseudolabels, thereby benefiting imbalanced learning. Furthermore, we explore a more practical setting by investigating semi-supervised learning under open-world conditions, where the unlabeled data may include out-of-distribution (OOD) samples. To handle this problem, we propose LoFT-OW (LoFT under Open-World scenarios) to improve the discriminative ability. Experimental results on multiple benchmarks demonstrate that our method achieves superior performance compared to previous approaches, even when utilizing only 1% of the unlabeled data compared with previous works.

[232] Multi-Play Combinatorial Semi-Bandit Problem

Shintaro Nakamura, Yuko Kuroki, Wei Chen

Main category: cs.LG

TL;DR: The paper introduces Multi-Play Combinatorial Semi-Bandit (MP-CSB) to extend combinatorial bandits to non-negative integer action spaces, proposing two algorithms with logarithmic regret in stochastic settings and best-of-both-worlds performance.

Details

Motivation: Traditional combinatorial semi-bandit problems are limited to binary decision spaces, excluding important applications like optimal transport and knapsack problems that require non-negative integer flows or allocations.

Method: Proposes two algorithms: 1) Thompson-sampling-based algorithm for exponentially large action spaces with O(log T) regret, and 2) Best-of-both-worlds algorithm that achieves O(log T) variance-dependent regret in stochastic regime and Õ(√T) data-dependent regret in adversarial regime.

Result: The algorithms achieve theoretical regret bounds: logarithmic in stochastic settings and sublinear in adversarial settings with data-dependent adaptation. Numerical experiments show they outperform existing CSB methods.

Conclusion: MP-CSB successfully extends combinatorial bandits to integer action spaces, providing efficient algorithms with strong theoretical guarantees and practical performance improvements over existing approaches.

Abstract: In the combinatorial semi-bandit (CSB) problem, a player selects an action from a combinatorial action set and observes feedback from the base arms included in the action. While CSB is widely applicable to combinatorial optimization problems, its restriction to binary decision spaces excludes important cases involving non-negative integer flows or allocations, such as the optimal transport and knapsack problems.To overcome this limitation, we propose the multi-play combinatorial semi-bandit (MP-CSB), where a player can select a non-negative integer action and observe multiple feedbacks from a single arm in each round. We propose two algorithms for the MP-CSB. One is a Thompson-sampling-based algorithm that is computationally feasible even when the action space is exponentially large with respect to the number of arms, and attains $O(\log T)$ distribution-dependent regret in the stochastic regime, where $T$ is the time horizon. The other is a best-of-both-worlds algorithm, which achieves $O(\log T)$ variance-dependent regret in the stochastic regime and the worst-case $\tilde{\mathcal{O}}\left( \sqrt{T} \right)$ regret in the adversarial regime. Moreover, its regret in adversarial one is data-dependent, adapting to the cumulative loss of the optimal action, the total quadratic variation, and the path-length of the loss sequence. Finally, we numerically show that the proposed algorithms outperform existing methods in the CSB literature.

[233] SciML Agents: Write the Solver, Not the Solution

Saarth Gaonkar, Xiang Zheng, Haocheng Xi, Rishabh Tiwari, Kurt Keutzer, Dmitriy Morozov, Michael W. Mahoney, Amir Gholami

Main category: cs.LG

TL;DR: LLMs can generate scientifically appropriate code for solving ODEs by leveraging numerical algorithms instead of directly learning solution functions, achieving high accuracy with proper prompting and fine-tuning.

Details

Motivation: Traditional scientific machine learning approaches struggle with accuracy and robustness when predicting solutions directly. This work explores using LLMs to write domain-aware numerical code as an alternative approach.

Method: Introduces two datasets: diagnostic adversarial problems and 1000 diverse ODE tasks. Evaluates LLMs with guided/unguided prompting and fine-tuning, measuring executability and numerical validity against reference solutions.

Result: Newer instruction-following LLMs achieve high accuracy with sufficient context and guided prompts. Open-source systems perform well without fine-tuning, while older/smaller models benefit from fine-tuning.

Conclusion: Careful prompting and fine-tuning can create specialized LLM agents capable of reliably solving simple ODE problems by making appropriate numerical choices rather than learning solutions directly.

Abstract: Recent work in scientific machine learning aims to tackle scientific tasks directly by predicting target values with neural networks (e.g., physics-informed neural networks, neural ODEs, neural operators, etc.), but attaining high accuracy and robustness has been challenging. We explore an alternative view: use LLMs to write code that leverages decades of numerical algorithms. This shifts the burden from learning a solution function to making domain-aware numerical choices. We ask whether LLMs can act as SciML agents that, given a natural-language ODE description, generate runnable code that is scientifically appropriate, selecting suitable solvers (stiff vs. non-stiff), and enforcing stability checks. There is currently no benchmark to measure this kind of capability for scientific computing tasks. As such, we first introduce two new datasets: a diagnostic dataset of adversarial “misleading” problems; and a large-scale benchmark of 1,000 diverse ODE tasks. The diagnostic set contains problems whose superficial appearance suggests stiffness, and that require algebraic simplification to demonstrate non-stiffness; and the large-scale benchmark spans stiff and non-stiff ODE regimes. We evaluate open- and closed-source LLM models along two axes: (i) unguided versus guided prompting with domain-specific knowledge; and (ii) off-the-shelf versus fine-tuned variants. Our evaluation measures both executability and numerical validity against reference solutions. We find that with sufficient context and guided prompts, newer instruction-following models achieve high accuracy on both criteria. In many cases, recent open-source systems perform strongly without fine-tuning, while older or smaller models still benefit from fine-tuning. Overall, our preliminary results indicate that careful prompting and fine-tuning can yield a specialized LLM agent capable of reliably solving simple ODE problems.

Yifei Wang, Wenbin Wang, Yong Luo

Main category: cs.LG

TL;DR: DyKen-Hyena introduces dynamic convolutional kernels from audio-visual cues to modulate textual feature extraction at token-level, achieving state-of-the-art performance on MIR benchmarks with significant improvement in out-of-scope detection.

Details

Motivation: Current multimodal intent recognition models risk corrupting linguistic features with noisy non-verbal signals through simple feature fusion, failing to capture fine-grained token-level modulation where non-verbal cues should properly influence textual meaning.

Method: The model reframes the problem from feature fusion to processing modulation by translating audio-visual cues into dynamic, per-token convolutional kernels that directly modulate textual feature extraction.

Result: Achieves state-of-the-art results on MIntRec and MIntRec2.0 benchmarks, with +10.46% F1-score improvement in out-of-scope detection.

Conclusion: The method creates a fundamentally more robust intent representation by enabling fine-grained modulation of textual features using non-verbal cues, rather than simple feature fusion.

Abstract: Though Multimodal Intent Recognition (MIR) proves effective by utilizing rich information from multiple sources (e.g., language, video, and audio), the potential for intent-irrelevant and conflicting information across modalities may hinder performance from being further improved. Most current models attempt to fuse modalities by applying mechanisms like multi-head attention to unimodal feature sequences and then adding the result back to the original representation. This process risks corrupting the primary linguistic features with noisy or irrelevant non-verbal signals, as it often fails to capture the fine-grained, token-level influence where non-verbal cues should modulate, not just augment, textual meaning. To address this, we introduce DyKen-Hyena, which reframes the problem from feature fusion to processing modulation. Our model translates audio-visual cues into dynamic, per-token convolutional kernels that directly modulate textual feature extraction. This fine-grained approach achieves state-of-the-art results on the MIntRec and MIntRec2.0 benchmarks. Notably, it yields a +10.46% F1-score improvement in out-of-scope detection, validating that our method creates a fundamentally more robust intent representation.

[235] Limited Reference, Reliable Generation: A Two-Component Framework for Tabular Data Generation in Low-Data Regimes

Mingxuan Jiang, Yongxin Wang, Ziyue Dai, Yicun Liu, Hongyi Nie, Sen Liu, Hongfeng Chai

Main category: cs.LG

TL;DR: ReFine is a framework that uses symbolic rules from interpretable models and dual-granularity filtering to generate high-quality synthetic tabular data, outperforming existing methods in both regression and classification tasks.

Details

Motivation: Existing tabular data generation methods require sufficient reference data and often fail to capture domain-specific feature-label dependencies, leading to poor performance in data-scarce domains.

Method: ReFine derives symbolic “if-then” rules from interpretable models to guide generation, and applies dual-granularity filtering to suppress over-sampling and refine rare informative samples.

Result: Achieves up to 0.44 absolute improvement in R-squared for regression and 10.0% relative improvement in F1 score for classification compared to state-of-the-art methods.

Conclusion: ReFine effectively addresses data scarcity issues in domain-specific databases and improves downstream task performance through rule-guided generation and intelligent filtering.

Abstract: Synthetic tabular data generation is increasingly essential in data management, supporting downstream applications when real-world and high-quality tabular data is insufficient. Existing tabular generation approaches, such as generative adversarial networks (GANs), diffusion models, and fine-tuned Large Language Models (LLMs), typically require sufficient reference data, limiting their effectiveness in domain-specific databases with scarce records. While prompt-based LLMs offer flexibility without parameter tuning, they often fail to capture dataset-specific feature-label dependencies and generate redundant data, leading to degradation in downstream task performance. To overcome these issues, we propose ReFine, a framework that (i) derives symbolic “if-then” rules from interpretable models and embeds them into prompts to explicitly guide generation toward domain-specific feature distribution, and (ii) applies a dual-granularity filtering strategy that suppresses over-sampling patterns and selectively refines rare but informative samples to reduce distributional imbalance. Extensive experiments on various regression and classification benchmarks demonstrate that ReFine consistently outperforms state-of-the-art methods, achieving up to 0.44 absolute improvement in R-squared for regression and 10.0 percent relative improvement in F1 score for classification tasks.

[236] Data-Driven Energy Estimation for Virtual Servers Using Combined System Metrics and Machine Learning

Amandip Sangha

Main category: cs.LG

TL;DR: Machine learning approach using guest VM resource metrics to estimate energy consumption without host access, achieving high accuracy (R² 0.90-0.97).

Details

Motivation: Addresses the inability to directly measure energy consumption in virtualized environments like cloud computing where physical power measurement interfaces are inaccessible.

Method: Uses Gradient Boosting Regressor trained on resource utilization metrics collected from guest virtual machines to predict energy consumption measured via RAPL on the host.

Result: Achieved high predictive accuracy with variance explained between 0.90 and 0.97 across diverse workloads, demonstrating feasibility of guest-side energy estimation.

Conclusion: Proves guest-only resource-based energy estimation is feasible without privileged host access, enabling energy-aware scheduling and cost optimization in virtualized environments.

Abstract: This paper presents a machine learning-based approach to estimate the energy consumption of virtual servers without access to physical power measurement interfaces. Using resource utilization metrics collected from guest virtual machines, we train a Gradient Boosting Regressor to predict energy consumption measured via RAPL on the host. We demonstrate, for the first time, guest-only resource-based energy estimation without privileged host access with experiments across diverse workloads, achieving high predictive accuracy and variance explained ($0.90 \leq R^2 \leq 0.97$), indicating the feasibility of guest-side energy estimation. This approach can enable energy-aware scheduling, cost optimization and physical host independent energy estimates in virtualized environments. Our approach addresses a critical gap in virtualized environments (e.g. cloud) where direct energy measurement is infeasible.

[237] Neural Scaling Laws for Deep Regression

Tilen Cadez, Kyoung-Min Kim

Main category: cs.LG

TL;DR: Empirical investigation of neural scaling laws in deep regression models for parameter estimation in twisted van der Waals magnets, showing power-law relationships between loss and dataset size/model capacity with exponents ranging from 1 to 2.

Details

Motivation: Neural scaling laws are crucial for developing reliable models with limited resources, but their application to deep regression models remains largely unexplored despite their importance in large language models.

Method: Used various neural network architectures (fully connected networks, residual networks, vision transformers) to study parameter estimation models for twisted van der Waals magnets across wide ranges of training dataset sizes and model capacities.

Result: Observed consistent power-law relationships between loss and both training dataset size and model capacity, with scaling exponents ranging from 1 to 2 depending on the regressed parameters and model details.

Conclusion: The large scaling exponents (1-2) suggest that deep regression models can achieve substantial performance improvements with increasing data size, demonstrating consistent scaling behaviors similar to those observed in other deep learning domains.

Abstract: Neural scaling laws–power-law relationships between generalization errors and characteristics of deep learning models–are vital tools for developing reliable models while managing limited resources. Although the success of large language models highlights the importance of these laws, their application to deep regression models remains largely unexplored. Here, we empirically investigate neural scaling laws in deep regression using a parameter estimation model for twisted van der Waals magnets. We observe power-law relationships between the loss and both training dataset size and model capacity across a wide range of values, employing various architectures–including fully connected networks, residual networks, and vision transformers. Furthermore, the scaling exponents governing these relationships range from 1 to 2, with specific values depending on the regressed parameters and model details. The consistent scaling behaviors and their large scaling exponents suggest that the performance of deep regression models can improve substantially with increasing data size.

[238] Intrinsic Dimension Estimating Autoencoder (IDEA) Using CancelOut Layer and a Projected Loss

Antoine Orioua, Philipp Krah, Julian Koellermeier

Main category: cs.LG

TL;DR: IDEA is an autoencoder that estimates intrinsic dimension and reconstructs datasets on linear/nonlinear manifolds using re-weighted double CancelOut layers and a novel projected reconstruction loss.

Details

Motivation: To develop a method that can accurately estimate the intrinsic dimension of datasets while also being able to reconstruct the original data from the identified latent space, particularly for complex datasets like fluid flow simulations.

Method: Uses an autoencoder architecture with re-weighted double CancelOut layers and introduces a projected reconstruction loss that continuously evaluates reconstruction quality when removing latent dimensions.

Result: IDEA shows good accuracy and high versatility on theoretical benchmarks, successfully estimating intrinsic dimensions and reconstructing complex fluid flow simulation data.

Conclusion: IDEA provides an effective approach for both intrinsic dimension estimation and data reconstruction, demonstrating robustness across various dataset types including complex scientific simulations.

Abstract: This paper introduces the Intrinsic Dimension Estimating Autoencoder (IDEA), which identifies the underlying intrinsic dimension of a wide range of datasets whose samples lie on either linear or nonlinear manifolds. Beyond estimating the intrinsic dimension, IDEA is also able to reconstruct the original dataset after projecting it onto the corresponding latent space, which is structured using re-weighted double CancelOut layers. Our key contribution is the introduction of the projected reconstruction loss term, guiding the training of the model by continuously assessing the reconstruction quality under the removal of an additional latent dimension. We first assess the performance of IDEA on a series of theoretical benchmarks to validate its robustness. These experiments allow us to test its reconstruction ability and compare its performance with state-of-the-art intrinsic dimension estimators. The benchmarks show good accuracy and high versatility of our approach. Subsequently, we apply our model to data generated from the numerical solution of a vertically resolved one-dimensional free-surface flow, following a pointwise discretization of the vertical velocity profile in the horizontal direction, vertical direction, and time. IDEA succeeds in estimating the dataset’s intrinsic dimension and then reconstructs the original solution by working directly within the projection space identified by the network.

[239] Exploring Expert Specialization through Unsupervised Training in Sparse Mixture of Experts

Strahinja Nikolic, Ilker Oguz, Demetri Psaltis

Main category: cs.LG

TL;DR: SMoE-VAE architecture with unsupervised expert routing outperforms supervised baseline on QuickDraw dataset, discovering meaningful sub-categorical structures beyond human-defined class boundaries.

Details

Motivation: To understand the internal organization of neural networks and explore how mixture of experts models can uncover fundamental data structures that may be more aligned with model objectives than predefined labels.

Method: Developed a Sparse Mixture of Experts Variational Autoencoder (SMoE-VAE) and tested it on QuickDraw dataset, comparing unsupervised expert routing against supervised baseline guided by ground-truth labels. Used t-SNE visualizations and reconstruction analysis.

Result: Unsupervised routing consistently achieved superior reconstruction performance. Experts learned meaningful sub-categorical structures that often transcend human-defined class boundaries. Study on dataset size impact revealed trade-offs between data quantity and expert specialization.

Conclusion: Unsupervised expert routing in MoE models can discover more fundamental data structures than predefined labels, providing guidance for designing efficient MoE architectures that better align with model objectives.

Abstract: Understanding the internal organization of neural networks remains a fundamental challenge in deep learning interpretability. We address this challenge by exploring a novel Sparse Mixture of Experts Variational Autoencoder (SMoE-VAE) architecture. We test our model on the QuickDraw dataset, comparing unsupervised expert routing against a supervised baseline guided by ground-truth labels. Surprisingly, we find that unsupervised routing consistently achieves superior reconstruction performance. The experts learn to identify meaningful sub-categorical structures that often transcend human-defined class boundaries. Through t-SNE visualizations and reconstruction analysis, we investigate how MoE models uncover fundamental data structures that are more aligned with the model’s objective than predefined labels. Furthermore, our study on the impact of dataset size provides insights into the trade-offs between data quantity and expert specialization, offering guidance for designing efficient MoE architectures.

[240] Sparse Coding Representation of 2-way Data

Boya Ma, Abram Magner, Maxwell McNeil, Petko Bogdanov

Main category: cs.LG

TL;DR: AODL is a low-rank coding model for sparse dictionary learning that reduces sample complexity and produces sparser solutions compared to traditional methods, while maintaining reconstruction quality and providing interpretable patterns.

Details

Motivation: Sparse dictionary coding faces challenges in multi-dictionary scenarios where encoding coefficients correspond to all atom combinations, requiring learning both dictionaries and coding coefficients which becomes computationally expensive and data-intensive.

Method: Proposed a low-rank coding model for 2-dictionary scenarios with convex relaxation solution (AODL), using alternating optimization between sparse coding matrices and learned dictionaries with proven convergence.

Result: AODL learns up to 90% sparser solutions compared to non-low-rank and analytical dictionary baselines while maintaining fixed reconstruction quality, and reveals interpretable insights into training data patterns.

Conclusion: The low-rank coding approach effectively addresses the data complexity challenge in multi-dictionary learning, providing sparser representations with good generalization and interpretable results for data reconstruction and missing value imputation.

Abstract: Sparse dictionary coding represents signals as linear combinations of a few dictionary atoms. It has been applied to images, time series, graph signals and multi-way spatio-temporal data by jointly employing temporal and spatial dictionaries. Data-agnostic analytical dictionaries, such as the discrete Fourier transform, wavelets and graph Fourier, have seen wide adoption due to efficient implementations and good practical performance. On the other hand, dictionaries learned from data offer sparser and more accurate solutions but require learning of both the dictionaries and the coding coefficients. This becomes especially challenging for multi-dictionary scenarios since encoding coefficients correspond to all atom combinations from the dictionaries. To address this challenge, we propose a low-rank coding model for 2-dictionary scenarios and study its data complexity. Namely, we establish a bound on the number of samples needed to learn dictionaries that generalize to unseen samples from the same distribution. We propose a convex relaxation solution, called AODL, whose exact solution we show also solves the original problem. We then solve this relaxation via alternating optimization between the sparse coding matrices and the learned dictionaries, which we prove to be convergent. We demonstrate its quality for data reconstruction and missing value imputation in both synthetic and real-world datasets. For a fixed reconstruction quality, AODL learns up to 90% sparser solutions compared to non-low-rank and analytical (fixed) dictionary baselines. In addition, the learned dictionaries reveal interpretable insights into patterns present within the samples used for training.

[241] Clip Your Sequences Fairly: Enforcing Length Fairness for Sequence-Level RL

Hanyi Mao, Quanjia Xiao, Lei Pang, Haixiao Liu

Main category: cs.LG

TL;DR: FSPO is a sequence-level RL method that addresses length bias in LLM training by introducing length-fair clipping in importance-sampling weight space, ensuring fair treatment of short vs long responses.

Details

Motivation: Existing sequence-level RL methods like PPO/GRPO exhibit length bias where fixed clip ranges systematically favor shorter or longer responses, distorting the training objective and creating unfair treatment based on response length.

Method: FSPO introduces a Gaussian-motivated clipping approach that applies KL-corrected drift term and scales clip range as √L (square root of length), clipping sequence log-IS ratios with a length-adaptive band to ensure fairness.

Result: Empirical results show FSPO flattens clip rates across different length bins, stabilizes training dynamics, and outperforms all baseline methods across multiple evaluation datasets.

Conclusion: FSPO successfully addresses length fairness in sequence-level RL by formalizing length reweighting error and providing a theoretically-grounded solution that ensures directional consistency between clipped and true updates while improving performance.

Abstract: We propose FSPO (Fair Sequence Policy Optimization), a sequence-level reinforcement learning method for LLMs that enforces length-fair clipping directly in the importance-sampling (IS) weight space. We revisit sequence-level RL methods and identify a mismatch when PPO/GRPO-style clipping is transplanted to sequences: a fixed clip range systematically reweights short vs. long responses, distorting the effective objective. Theoretically, we formalize length fairness via a Length Reweighting Error (LRE) and prove that small LRE yields a directional cosine guarantee between the clipped and true updates. FSPO introduces a simple, Gaussian-motivated remedy: we clip the sequence log-IS ratio with a band that applies a KL-corrected drift term and scales as $\sqrt{L}$. Empirically, FSPO flattens clip rates across length bins, stabilizes training, and outperforms all baselines across multiple evaluation datasets.

[242] Symbolic Feedforward Networks for Probabilistic Finite Automata: Exact Simulation and Learnability

Sahil Rajesh Dhayalkar

Main category: cs.LG

TL;DR: Probabilistic finite automata can be exactly simulated using symbolic feedforward neural networks that represent state distributions as vectors and transitions as matrices, enabling parallel and differentiable simulation without recurrence.

Details

Motivation: To bridge the gap between symbolic computation (probabilistic automata theory) and deep learning by developing a unified algebraic framework that connects these two domains.

Method: Using symbolic feedforward neural networks that represent state distributions as vectors and transitions as stochastic matrices, enabling probabilistic state propagation through matrix-vector products. The approach includes probabilistic subset construction, ε-closure, and exact simulation via layered symbolic computation.

Result: The neural networks can exactly simulate PFAs and are learnable - when trained with gradient descent on labeled sequence data, they recover the exact behavior of ground-truth PFAs. Formal equivalence is proven between PFAs and specific neural network classes.

Conclusion: This work successfully unifies probabilistic automata theory with neural architectures, demonstrating that symbolic neural networks can exactly simulate and learn PFAs, bridging symbolic computation and deep learning under a rigorous algebraic framework.

Abstract: We present a formal and constructive theory showing that probabilistic finite automata (PFAs) can be exactly simulated using symbolic feedforward neural networks. Our architecture represents state distributions as vectors and transitions as stochastic matrices, enabling probabilistic state propagation via matrix-vector products. This yields a parallel, interpretable, and differentiable simulation of PFA dynamics using soft updates-without recurrence. We formally characterize probabilistic subset construction, $\varepsilon$-closure, and exact simulation via layered symbolic computation, and prove equivalence between PFAs and specific classes of neural networks. We further show that these symbolic simulators are not only expressive but learnable: trained with standard gradient descent-based optimization on labeled sequence data, they recover the exact behavior of ground-truth PFAs. This learnability, formalized in Proposition 5.1, is the crux of this work. Our results unify probabilistic automata theory with neural architectures under a rigorous algebraic framework, bridging the gap between symbolic computation and deep learning.

[243] AEGIS: An Agent for Extraction and Geographic Identification in Scholarly Proceedings

Om Vishesh, Harshad Khadilkar, Deepak Akkil

Main category: cs.LG

TL;DR: A fully automated AI system called ‘Agent-E’ that identifies academic papers from specific geographic regions and uses robotic process automation to complete predefined actions like form submissions, achieving 100% recall and 99.4% accuracy.

Details

Motivation: Addressing the challenge of keeping up with rapidly growing academic literature and reducing time-consuming manual effort required for scholarly discovery and administrative tasks.

Method: A specialized AI agent pipeline that transitions from data discovery to direct action, using robotic process automation (RPA) to complete predefined actions after identifying target papers from conference proceedings.

Result: Validated on 586 papers from five conferences, achieving perfect recall (100%) and near-perfect accuracy (99.4%) in identifying target papers and executing automated actions.

Conclusion: Task-oriented AI agents can not only filter information but also actively participate in and accelerate academic workflows, demonstrating significant potential for automating scholarly processes.

Abstract: Keeping pace with the rapid growth of academia literature presents a significant challenge for researchers, funding bodies, and academic societies. To address the time-consuming manual effort required for scholarly discovery, we present a novel, fully automated system that transitions from data discovery to direct action. Our pipeline demonstrates how a specialized AI agent, ‘Agent-E’, can be tasked with identifying papers from specific geographic regions within conference proceedings and then executing a Robotic Process Automation (RPA) to complete a predefined action, such as submitting a nomination form. We validated our system on 586 papers from five different conferences, where it successfully identified every target paper with a recall of 100% and a near perfect accuracy of 99.4%. This demonstration highlights the potential of task-oriented AI agents to not only filter information but also to actively participate in and accelerate the workflows of the academic community.

[244] FedRP: A Communication-Efficient Approach for Differentially Private Federated Learning Using Random Projection

Mohammad Hasan Narimani, Mostafa Tavassolipour

Main category: cs.LG

TL;DR: FedRP is a novel federated learning algorithm that combines random projection with ADMM optimization to enhance privacy protection and reduce communication costs while maintaining high model accuracy.

Details

Motivation: Federated learning faces challenges in user privacy protection against attacks and high communication costs, especially in sensitive domains like IoT and medical data analysis.

Method: Integrates random projection techniques with ADMM optimization framework to reduce dimensionality of model parameters before transmission, providing strong differential privacy guarantees.

Result: FedRP maintains high model accuracy while outperforming existing methods (including conventional differential privacy and FedADMM) in both privacy preservation and communication efficiency.

Conclusion: The proposed FedRP algorithm successfully addresses key FL challenges by providing robust privacy protection with reduced communication overhead, making it suitable for privacy-sensitive applications.

Abstract: Federated learning (FL) offers an innovative paradigm for collaborative model training across decentralized devices, such as smartphones, balancing enhanced predictive performance with the protection of user privacy in sensitive areas like Internet of Things (IoT) and medical data analysis. Despite its advantages, FL encounters significant challenges related to user privacy protection against potential attacks and the management of communication costs. This paper introduces a novel federated learning algorithm called FedRP, which integrates random projection techniques with the Alternating Direction Method of Multipliers (ADMM) optimization framework. This approach enhances privacy by employing random projection to reduce the dimensionality of model parameters prior to their transmission to a central server, reducing the communication cost. The proposed algorithm offers a strong $(\epsilon, \delta)$-differential privacy guarantee, demonstrating resilience against data reconstruction attacks. Experimental results reveal that FedRP not only maintains high model accuracy but also outperforms existing methods, including conventional differential privacy approaches and FedADMM, in terms of both privacy preservation and communication efficiency.

[245] Uncertainty-Aware Tabular Prediction: Evaluating VBLL-Enhanced TabPFN in Safety-Critical Medical Data

Madhushan Ramalingam

Main category: cs.LG

TL;DR: VBLL integration with TabPFN underperforms original TabPFN in uncertainty calibration across medical datasets

Details

Motivation: Reliable uncertainty estimation is crucial for safety-critical applications like medical diagnosis, and TabPFN is a promising foundation model for tabular data that could benefit from improved uncertainty calibration

Method: Integrated Variational Bayesian Last Layers (VBLL) with TabPFN and compared uncertainty calibration performance against original TabPFN on three benchmark medical tabular datasets

Result: Original TabPFN consistently outperformed VBLL-integrated TabPFN in uncertainty calibration across all datasets, contrary to expectations

Conclusion: VBLL integration does not improve uncertainty calibration for TabPFN, suggesting the original model already provides strong uncertainty estimation capabilities

Abstract: Predictive models are being increasingly used across a wide range of domains, including safety-critical applications such as medical diagnosis and criminal justice. Reliable uncertainty estimation is a crucial task in such settings. Tabular Prior-data Fitted Network (TabPFN) is a recently proposed machine learning foundation model for tabular dataset, which uses a generative transformer architecture. Variational Bayesian Last Layers (VBLL) is a state-of-the-art lightweight variational formulation that effectively improves uncertainty estimation with minimal computational overhead. In this work we aim to evaluate the performance of VBLL integrated with the recently proposed TabPFN in uncertainty calibration. Our experiments, conducted on three benchmark medical tabular datasets, compare the performance of the original TabPFN and the VBLL-integrated version. Contrary to expectations, we observed that original TabPFN consistently outperforms VBLL integrated TabPFN in uncertainty calibration across all datasets.

[246] KAN-SR: A Kolmogorov-Arnold Network Guided Symbolic Regression Framework

Marco Andrea Bühler, Gonzalo Guillén-Gosálbez

Main category: cs.LG

TL;DR: KAN-SR is a novel symbolic regression framework using Kolmogorov Arnold Networks with deep learning and simplification strategies to accurately recover mathematical equations from data and model dynamic systems.

Details

Motivation: Traditional symbolic regression uses genetic programming approaches, but deep learning techniques combined with KANs can provide more accurate and efficient equation discovery, particularly for scientific discovery and engineering system modeling.

Method: Uses Kolmogorov Arnold Networks (KANs) with divide-and-conquer approach, deep learning techniques, and simplification strategies including translational symmetries and separabilities. Combines with neural controlled differential equations for dynamic system modeling.

Result: Successfully recovers ground-truth equations from the Feynman SRSD dataset and precisely models the dynamics of an in-silico bioprocess system.

Conclusion: KAN-SR framework demonstrates superior performance in symbolic regression and opens doors for dynamic modeling of various engineering systems using neural controlled differential equations.

Abstract: We introduce a novel symbolic regression framework, namely KAN-SR, built on Kolmogorov Arnold Networks (KANs) which follows a divide-and-conquer approach. Symbolic regression searches for mathematical equations that best fit a given dataset and is commonly solved with genetic programming approaches. We show that by using deep learning techniques, more specific KANs, and combining them with simplification strategies such as translational symmetries and separabilities, we are able to recover ground-truth equations of the Feynman Symbolic Regression for Scientific Discovery (SRSD) dataset. Additionally, we show that by combining the proposed framework with neural controlled differential equations, we are able to model the dynamics of an in-silico bioprocess system precisely, opening the door for the dynamic modeling of other engineering systems.

[247] Cost-Free Personalization via Information-Geometric Projection in Bayesian Federated Learning

Nour Jamoussi, Giuseppe Serra, Photios A. Stavrou, Marios Kountouris

Main category: cs.LG

TL;DR: Proposes an information-geometric projection framework for personalized Bayesian Federated Learning that achieves tunable trade-off between global generalization and local specialization with minimal computational overhead.

Details

Motivation: Bayesian Federated Learning needs better personalization mechanisms to handle data heterogeneity while maintaining privacy constraints, as existing MCMC and variational inference approaches often lack efficient personalization.

Method: Information-geometric projection framework that projects global model onto neighborhood of user’s local model, equivalent to computing barycenter on statistical manifold, using IVON optimizer and extending to general BFL aggregation schemes.

Result: Empirical evaluations under heterogeneous data distributions show effective balance between global and local performance with minimal computational overhead.

Conclusion: The proposed framework enables cost-free personalization with closed-form solutions, providing a tunable trade-off between global generalization and local specialization in Bayesian Federated Learning.

Abstract: Bayesian Federated Learning (BFL) combines uncertainty modeling with decentralized training, enabling the development of personalized and reliable models under data heterogeneity and privacy constraints. Existing approaches typically rely on Markov Chain Monte Carlo (MCMC) sampling or variational inference, often incorporating personalization mechanisms to better adapt to local data distributions. In this work, we propose an information-geometric projection framework for personalization in parametric BFL. By projecting the global model onto a neighborhood of the user’s local model, our method enables a tunable trade-off between global generalization and local specialization. Under mild assumptions, we show that this projection step is equivalent to computing a barycenter on the statistical manifold, allowing us to derive closed-form solutions and achieve cost-free personalization. We apply the proposed approach to a variational learning setup using the Improved Variational Online Newton (IVON) optimizer and extend its application to general aggregation schemes in BFL. Empirical evaluations under heterogeneous data distributions confirm that our method effectively balances global and local performance with minimal computational overhead.

[248] BenchECG and xECG: a benchmark and baseline for ECG foundation models

Riccardo Lunelli, Angus Nicolson, Samuel Martin Pröll, Sebastian Johannes Reinstadler, Axel Bauer, Clemens Dlaska

Main category: cs.LG

TL;DR: BenchECG is a standardized benchmark for ECG foundation models, and xECG (an xLSTM-based model with SimDINOv2 self-supervised learning) achieves state-of-the-art performance across all datasets and tasks.

Details

Motivation: Previous ECG foundation model evaluations lacked consistency, using narrow task selections and inconsistent datasets, which hindered fair comparison between different approaches.

Method: Proposed BenchECG benchmark with comprehensive public ECG datasets and versatile tasks. Developed xECG using xLSTM-based recurrent model trained with SimDINOv2 self-supervised learning.

Result: xECG achieved the best BenchECG score compared to publicly available state-of-the-art models, and was the only publicly available model to perform strongly on all datasets and tasks.

Conclusion: BenchECG enables rigorous comparison and accelerates progress in ECG representation learning, while xECG sets a new baseline for future ECG foundation models with superior performance over earlier approaches.

Abstract: Electrocardiograms (ECGs) are inexpensive, widely used, and well-suited to deep learning. Recently, interest has grown in developing foundation models for ECGs - models that generalise across diverse downstream tasks. However, consistent evaluation has been lacking: prior work often uses narrow task selections and inconsistent datasets, hindering fair comparison. Here, we introduce BenchECG, a standardised benchmark comprising a comprehensive suite of publicly available ECG datasets and versatile tasks. We also propose xECG, an xLSTM-based recurrent model trained with SimDINOv2 self-supervised learning, which achieves the best BenchECG score compared to publicly available state-of-the-art models. In particular, xECG is the only publicly available model to perform strongly on all datasets and tasks. By standardising evaluation, BenchECG enables rigorous comparison and aims to accelerate progress in ECG representation learning. xECG achieves superior performance over earlier approaches, defining a new baseline for future ECG foundation models.

[249] FedBiF: Communication-Efficient Federated Learning via Bits Freezing

Shiwei Li, Qunwei Li, Haozhao Wang, Ruixuan Li, Jianbin Lin, Wenliang Zhong

Main category: cs.LG

TL;DR: FedBiF is a federated learning framework that learns quantized parameters during local training by freezing most bits and updating only one bit per parameter, achieving high compression (1bpp uplink/3bpp downlink) while maintaining accuracy comparable to FedAvg.

Details

Motivation: Federated learning suffers from high communication overhead, and existing quantization methods applied after training introduce errors that degrade model accuracy.

Method: Server quantizes model parameters and transmits to clients; each client updates only one bit of multi-bit parameter representation while freezing remaining bits during local training.

Result: Achieves superior communication compression and model sparsity while maintaining accuracy comparable to FedAvg across 5 datasets under both IID and Non-IID settings.

Conclusion: FedBiF effectively reduces communication costs through bit-by-bit updates during training, enabling efficient federated learning without sacrificing model performance.

Abstract: Federated learning (FL) is an emerging distributed machine learning paradigm that enables collaborative model training without sharing local data. Despite its advantages, FL suffers from substantial communication overhead, which can affect training efficiency. Recent efforts have mitigated this issue by quantizing model updates to reduce communication costs. However, most existing methods apply quantization only after local training, introducing quantization errors into the trained parameters and potentially degrading model accuracy. In this paper, we propose Federated Bit Freezing (FedBiF), a novel FL framework that directly learns quantized model parameters during local training. In each communication round, the server first quantizes the model parameters and transmits them to the clients. FedBiF then allows each client to update only a single bit of the multi-bit parameter representation, freezing the remaining bits. This bit-by-bit update strategy reduces each parameter update to one bit while maintaining high precision in parameter representation. Extensive experiments are conducted on five widely used datasets under both IID and Non-IID settings. The results demonstrate that FedBiF not only achieves superior communication compression but also promotes sparsity in the resulting models. Notably, FedBiF attains accuracy comparable to FedAvg, even when using only 1 bit-per-parameter (bpp) for uplink and 3 bpp for downlink communication. The code is available at https://github.com/Leopold1423/fedbif-tpds25.

[250] Federated Multi-Agent Reinforcement Learning for Privacy-Preserving and Energy-Aware Resource Management in 6G Edge Networks

Francisco Javier Esono Nkulu Andong, Qi Min

Main category: cs.LG

TL;DR: Fed-MARL framework for 6G networks combining federated learning and multi-agent reinforcement learning for privacy-preserving, energy-efficient resource management across heterogeneous edge devices.

Details

Motivation: 6G networks require efficient resource management under strict privacy, mobility, and energy constraints in ultra-dense intelligent edge environments.

Method: Federated Multi-Agent Reinforcement Learning with Deep Recurrent Q-Networks for decentralized policies, secure aggregation protocol using elliptic curve Diffie Hellman, and POMMDP formulation with multi-objective reward function.

Result: Outperforms centralized MARL and heuristic baselines in task success rate, latency, energy efficiency, and fairness while ensuring robust privacy protection and scalability.

Conclusion: Fed-MARL provides an effective solution for real-time resource management in dynamic 6G edge networks with strong privacy guarantees and multi-objective optimization.

Abstract: As sixth-generation (6G) networks move toward ultra-dense, intelligent edge environments, efficient resource management under stringent privacy, mobility, and energy constraints becomes critical. This paper introduces a novel Federated Multi-Agent Reinforcement Learning (Fed-MARL) framework that incorporates cross-layer orchestration of both the MAC layer and application layer for energy-efficient, privacy-preserving, and real-time resource management across heterogeneous edge devices. Each agent uses a Deep Recurrent Q-Network (DRQN) to learn decentralized policies for task offloading, spectrum access, and CPU energy adaptation based on local observations (e.g., queue length, energy, CPU usage, and mobility). To protect privacy, we introduce a secure aggregation protocol based on elliptic curve Diffie Hellman key exchange, which ensures accurate model updates without exposing raw data to semi-honest adversaries. We formulate the resource management problem as a partially observable multi-agent Markov decision process (POMMDP) with a multi-objective reward function that jointly optimizes latency, energy efficiency, spectral efficiency, fairness, and reliability under 6G-specific service requirements such as URLLC, eMBB, and mMTC. Simulation results demonstrate that Fed-MARL outperforms centralized MARL and heuristic baselines in task success rate, latency, energy efficiency, and fairness, while ensuring robust privacy protection and scalability in dynamic, resource-constrained 6G edge networks.

[251] A Symmetry-Integrated Approach to Surface Code Decoding

Hoshitaro Ohnishi, Hideo Mukai

Main category: cs.LG

TL;DR: Proposes a neural network-based reoptimization technique for surface code quantum error correction that improves decoder accuracy by treating syndrome measurements as a continuous regression problem rather than discrete classification.

Details

Motivation: Previous surface code decoders suffer from non-uniqueness of correct predictions, acquiring only error probability distributions rather than definitive corrections due to the discrete nature of syndrome measurements.

Method: Approximates syndrome measurements with a continuous function mathematically interpolated by neural networks, re-framing decoding as a regression problem. Evaluated multilayer perceptrons, convolutional neural networks, recurrent neural networks, and transformers for code distances 5 and 7.

Result: Reoptimized decoders showed improved accuracy across all tested architectures and code distances (5 and 7), demonstrating universal effectiveness independent of network architecture or code distance.

Conclusion: Reframing surface code decoding as a regression problem solvable by deep learning is an effective strategy that universally improves decoder performance across different neural network architectures and code distances.

Abstract: Quantum error correction, which utilizes logical qubits that are encoded as redundant multiple physical qubits to find and correct errors in physical qubits, is indispensable for practical quantum computing. Surface code is considered to be a promising encoding method with a high error threshold that is defined by stabilizer generators. However, previous methods have suffered from the problem that the decoder acquires solely the error probability distribution because of the non-uniqueness of correct prediction obtained from the input. To circumvent this problem, we propose a technique to reoptimize the decoder model by approximating syndrome measurements with a continuous function that is mathematically interpolated by neural network. We evaluated the improvement in accuracy of a multilayer perceptron based decoder for code distances of 5 and 7 as well as for decoders based on convolutional and recurrent neural networks and transformers for a code distance of 5. In all cases, the reoptimized decoder gave better accuracy than the original models, demonstrating the universal effectiveness of the proposed method that is independent of code distance or network architecture. These results suggest that re-framing the problem of surface code decoding into a regression problem that can be tackled by deep learning is a useful strategy.

[252] The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagrams

Lénaïc Chizat

Main category: cs.LG

TL;DR: The paper analyzes gradient-based training of deep ResNets, showing convergence to Neural Mean ODE dynamics with error bounds that depend on depth, width, and residual scaling. It identifies regimes for complete feature learning vs lazy training.

Details

Motivation: To understand the training dynamics of large-depth residual networks from random initialization, particularly how different scaling regimes affect feature learning capabilities and convergence behavior.

Method: Mathematical analysis of ResNet training dynamics using stochastic approximation and propagation of chaos techniques. Study different residual scaling regimes (Θ(α/LM) and their impact on convergence to Neural Mean ODEs.

Result: Established error bounds O(1/L + α/√LM) between ResNet output and limit dynamics. Identified Θ(√D/LM) scaling as optimal for complete feature learning in two-layer perceptron ResNets. Showed α→∞ leads to lazy training regime.

Conclusion: Deep ResNets converge to Neural Mean ODE dynamics with quantifiable error bounds. The residual scaling significantly impacts feature learning capabilities, with specific scaling (Θ(√D/LM)) enabling complete non-linear feature learning while other scalings lead to lazy linear regimes.

Abstract: We study the gradient-based training of large-depth residual networks (ResNets) from standard random initializations. We show that with a diverging depth $L$, a fixed embedding dimension $D$, and an arbitrary hidden width $M$, the training dynamics converges to a Neural Mean ODE training dynamics. Remarkably, the limit is independent of the scaling of $M$, covering practical cases of, say, Transformers, where $M$ (the number of hidden units or attention heads per layer) is typically of the order of $D$. For a residual scale $\Theta_D\big(\frac{\alpha}{LM}\big)$, we obtain the error bound $O_D\big(\frac{1}{L}+ \frac{\alpha}{\sqrt{LM}}\big)$ between the model’s output and its limit after a fixed number gradient of steps, and we verify empirically that this rate is tight. When $\alpha=\Theta(1)$, the limit exhibits complete feature learning, i.e. the Mean ODE is genuinely non-linearly parameterized. In contrast, we show that $\alpha \to \infty$ yields a \lazy ODE regime where the Mean ODE is linearly parameterized. We then focus on the particular case of ResNets with two-layer perceptron blocks, for which we study how these scalings depend on the embedding dimension $D$. We show that for this model, the only residual scale that leads to complete feature learning is $\Theta\big(\frac{\sqrt{D}}{LM}\big)$. In this regime, we prove the error bound $O\big(\frac{1}{L}+ \frac{\sqrt{D}}{\sqrt{LM}}\big)$ between the ResNet and its limit after a fixed number of gradient steps, which is also empirically tight. Our convergence results rely on a novel mathematical perspective on ResNets : (i) due to the randomness of the initialization, the forward and backward pass through the ResNet behave as the stochastic approximation of certain mean ODEs, and (ii) by propagation of chaos (that is, asymptotic independence of the units) this behavior is preserved through the training dynamics.

[253] P3D: Scalable Neural Surrogates for High-Resolution 3D Physics Simulations with Global Context

Benjamin Holzschuh, Georg Kohl, Florian Redinger, Nils Thuerey

Main category: cs.LG

TL;DR: A scalable framework for learning neural surrogates for high-resolution 3D physics simulations using a hybrid CNN-Transformer architecture that outperforms existing methods in speed and accuracy.

Details

Motivation: To create efficient neural surrogates for high-resolution 3D physics simulations that can handle complex PDE dynamics with reduced computational requirements.

Method: Hybrid CNN-Transformer backbone architecture pretrained on small simulation patches, fused for global solutions, optionally guided by sequence-to-sequence modeling for long-range dependencies.

Result: Significantly outperforms baseline methods, scales to 512^3 spatial resolution turbulence, and successfully learns dynamics of 14 different 3D PDE types. Also works as diffusion model for probabilistic turbulent flow sampling.

Conclusion: The proposed framework provides an effective and scalable approach for learning deterministic and probabilistic neural surrogates for high-resolution 3D physics simulations across various PDE types and complex turbulent flows.

Abstract: We present a scalable framework for learning deterministic and probabilistic neural surrogates for high-resolution 3D physics simulations. We introduce a hybrid CNN-Transformer backbone architecture targeted for 3D physics simulations, which significantly outperforms existing architectures in terms of speed and accuracy. Our proposed network can be pretrained on small patches of the simulation domain, which can be fused to obtain a global solution, optionally guided via a fast and scalable sequence-to-sequence model to include long-range dependencies. This setup allows for training large-scale models with reduced memory and compute requirements for high-resolution datasets. We evaluate our backbone architecture against a large set of baseline methods with the objective to simultaneously learn the dynamics of 14 different types of PDEs in 3D. We demonstrate how to scale our model to high-resolution isotropic turbulence with spatial resolutions of up to $512^3$. Finally, we demonstrate the versatility of our network by training it as a diffusion model to produce probabilistic samples of highly turbulent 3D channel flows across varying Reynolds numbers, accurately capturing the underlying flow statistics.

[254] Hadamard-Riemannian Optimization for Margin-Variance Ensemble

Zexu Jin

Main category: cs.LG

TL;DR: A novel ensemble learning framework that optimizes both expected margin and margin variance, with reparameterized weights on unit sphere for better efficiency and performance.

Details

Motivation: Conventional margin-based ensemble methods focus only on maximizing expected margin while ignoring margin variance, limiting generalization and causing overfitting, especially in noisy/imbalanced datasets. Traditional weight optimization in probability simplex is computationally inefficient.

Method: Proposes ensemble framework that explicitly incorporates margin variance into loss function, jointly optimizing negative expected margin and its variance. Reparameterizes ensemble weights onto unit sphere to simplify optimization and improve computational efficiency.

Result: Extensive experiments on multiple benchmark datasets show the proposed approach consistently outperforms traditional margin-based ensemble techniques.

Conclusion: The method enhances robustness, improves generalization performance, and offers better computational efficiency compared to conventional ensemble approaches.

Abstract: Ensemble learning has been widely recognized as a pivotal technique for boosting predictive performance by combining multiple base models. Nevertheless, conventional margin-based ensemble methods predominantly focus on maximizing the expected margin while neglecting the critical role of margin variance, which inherently restricts the generalization capability of the model and heightens its vulnerability to overfitting, particularly in noisy or imbalanced datasets. Additionally, the conventional approach of optimizing ensemble weights within the probability simplex often introduces computational inefficiency and scalability challenges, complicating its application to large-scale problems. To tackle these limitations, this paper introduces a novel ensemble learning framework that explicitly incorporates margin variance into the loss function. Our method jointly optimizes the negative expected margin and its variance, leading to enhanced robustness and improved generalization performance. Moreover, by reparameterizing the ensemble weights onto the unit sphere, we substantially simplify the optimization process and improve computational efficiency. Extensive experiments conducted on multiple benchmark datasets demonstrate that the proposed approach consistently outperforms traditional margin-based ensemble techniques, underscoring its effectiveness and practical utility.

[255] A Certifiable Machine Learning-Based Pipeline to Predict Fatigue Life of Aircraft Structures

Ángel Ladrón, Miguel Sánchez-Domínguez, Javier Rozalén, Fernando R. Sánchez, Javier de Vicente, Lucas Lacasa, Eusebio Valero, Gonzalo Rubio

Main category: cs.LG

TL;DR: ML-based pipeline for aircraft wing fatigue life prediction using flight parameters, reducing need for costly simulations while maintaining accuracy.

Details

Motivation: Traditional fatigue life prediction methods are time-consuming, complex, and require multiple teams/tools. ML can complement these methods by providing faster iterations and generalization.

Method: Machine learning pipeline that estimates fatigue life at different aircraft wing locations based on flight parameters from various missions throughout operational life.

Result: Accurate fatigue life predictions with thorough statistical validation and uncertainty quantification in realistic use cases.

Conclusion: The ML pipeline serves as a valuable complement to traditional methodologies by reducing computational and human resource requirements while maintaining prediction accuracy.

Abstract: Fatigue life prediction is essential in both the design and operational phases of any aircraft, and in this sense safety in the aerospace industry requires early detection of fatigue cracks to prevent in-flight failures. Robust and precise fatigue life predictors are thus essential to ensure safety. Traditional engineering methods, while reliable, are time consuming and involve complex workflows, including steps such as conducting several Finite Element Method (FEM) simulations, deriving the expected loading spectrum, and applying cycle counting techniques like peak-valley or rainflow counting. These steps often require collaboration between multiple teams and tools, added to the computational time and effort required to achieve fatigue life predictions. Machine learning (ML) offers a promising complement to traditional fatigue life estimation methods, enabling faster iterations and generalization, providing quick estimates that guide decisions alongside conventional simulations. In this paper, we present a ML-based pipeline that aims to estimate the fatigue life of different aircraft wing locations given the flight parameters of the different missions that the aircraft will be operating throughout its operational life. We validate the pipeline in a realistic use case of fatigue life estimation, yielding accurate predictions alongside a thorough statistical validation and uncertainty quantification. Our pipeline constitutes a complement to traditional methodologies by reducing the amount of costly simulations and, thereby, lowering the required computational and human resources.

[256] Prompt Injection Attacks on LLM Generated Reviews of Scientific Publications

Janis Keuper

Main category: cs.LG

TL;DR: Simple prompt injections can manipulate LLM peer reviews with 100% effectiveness, and LLMs show strong bias toward paper acceptance (>95%) in scientific peer review.

Details

Motivation: To investigate the practicability and technical success of hidden prompt injections in manipulating LLM-based scientific peer review scores, as this would significantly impact the debate around LLM usage in peer review.

Method: Systematic evaluation using 1,000 reviews of 2024 ICLR papers generated by a wide range of LLMs to test the effectiveness of simple prompt injections.

Result: Very simple prompt injections are highly effective (up to 100% acceptance scores) and LLM reviews are generally biased toward acceptance (>95% in many models).

Conclusion: Both findings have significant implications for the ongoing discussion about LLM usage in scientific peer review, demonstrating vulnerabilities to manipulation and inherent acceptance bias.

Abstract: The ongoing intense discussion on rising LLM usage in the scientific peer-review process has recently been mingled by reports of authors using hidden prompt injections to manipulate review scores. Since the existence of such “attacks” - although seen by some commentators as “self-defense” - would have a great impact on the further debate, this paper investigates the practicability and technical success of the described manipulations. Our systematic evaluation uses 1k reviews of 2024 ICLR papers generated by a wide range of LLMs shows two distinct results: I) very simple prompt injections are indeed highly effective, reaching up to 100% acceptance scores. II) LLM reviews are generally biased toward acceptance (>95% in many models). Both results have great impact on the ongoing discussions on LLM usage in peer-review.

[257] Property prediction for ionic liquids without prior structural knowledge using limited experimental data: A data-driven neural recommender system leveraging transfer learning

Sahil Sethi, Kai Sundmacher, Caroline Ganzer

Main category: cs.LG

TL;DR: A transfer learning framework using neural recommender systems to predict ionic liquid properties by combining COSMO-RS simulations with sparse experimental data, achieving improved performance for density, viscosity, surface tension, heat capacity, and melting point predictions.

Details

Motivation: Ionic liquids have versatile applications but predicting their thermophysical properties is challenging due to the vast chemical design space and limited experimental data availability.

Method: Two-stage process: 1) Pre-training neural recommender systems on COSMO-RS simulated data to learn property-specific structural embeddings for cations and anions, 2) Fine-tuning feedforward neural networks using these embeddings with experimental data at varying temperatures and pressures.

Result: The framework achieved substantial performance improvements for four out of five target properties (density, viscosity, surface tension, heat capacity, melting point) and enabled robust extrapolation to unseen ILs. Models can predict properties for over 700,000 IL combinations.

Conclusion: Combining simulated data with transfer learning effectively overcomes experimental data sparsity, providing a scalable solution for ionic liquid screening in process design.

Abstract: Ionic liquids (ILs) have emerged as versatile replacements for traditional solvents because their physicochemical properties can be precisely tailored to various applications. However, accurately predicting key thermophysical properties remains challenging due to the vast chemical design space and the limited availability of experimental data. In this study, we present a data-driven transfer learning framework that leverages a neural recommender system (NRS) to enable reliable property prediction for ILs using sparse experimental datasets. The approach involves a two-stage process: first, pre-training NRS models on COSMO-RS-based simulated data at fixed temperature and pressure to learn property-specific structural embeddings for cations and anions; and second, fine-tuning simple feedforward neural networks using these embeddings with experimental data at varying temperatures and pressures. In this work, five essential IL properties are considered: density, viscosity, surface tension, heat capacity, and melting point. The framework supports both within-property and cross-property knowledge transfer. Notably, pre-trained models for density, viscosity, and heat capacity are used to fine-tune models for all five target properties, achieving improved performance by a substantial margin for four of them. The model exhibits robust extrapolation to previously unseen ILs. Moreover, the final trained models enable property prediction for over 700,000 IL combinations, offering a scalable solution for IL screening in process design. This work highlights the effectiveness of combining simulated data and transfer learning to overcome sparsity in the experimental data.

[258] Proof of AutoML: SDN based Secure Energy Trading with Blockchain in Disaster Case

Salih Toprak, Muge Erel-Ozcevik

Main category: cs.LG

TL;DR: Proposes Proof of AutoML - using ML regressors as nonce generators for blockchain-secured energy trading in disaster scenarios, with SDN enabling adaptive network control.

Details

Motivation: Need for secure, traceable energy trading when conventional infrastructure fails in disasters, requiring robust nonce generation for blockchain integrity.

Method: SDN-enabled architecture with AutoML-selected regression models (Gradient Boosting, LightGBM, Random Forest, Extra Trees, K-NN) evaluated for randomness generation rather than prediction accuracy using 9000-sample dataset.

Result: Random Forest and Extra Trees show complete randomness dependency; Gradient Boosting (97.6%), K-NN (98.8%), LightGBM (99.9%) show strong randomness. Tree-based ensembles most effective as nonce generators.

Conclusion: Machine learning models, particularly tree-based ensembles, can serve as effective lightweight nonce generators for blockchain-secured SDN-based energy trading systems resilient to disaster conditions.

Abstract: In disaster scenarios where conventional energy infrastructure is compromised, secure and traceable energy trading between solar-powered households and mobile charging units becomes a necessity. To ensure the integrity of such transactions over a blockchain network, robust and unpredictable nonce generation is vital. This study proposes an SDN-enabled architecture where machine learning regressors are leveraged not for their accuracy, but for their potential to generate randomized values suitable as nonce candidates. Therefore, it is newly called Proof of AutoML. Here, SDN allows flexible control over data flows and energy routing policies even in fragmented or degraded networks, ensuring adaptive response during emergencies. Using a 9000-sample dataset, we evaluate five AutoML-selected regression models

Gradient Boosting, LightGBM, Random Forest, Extra Trees, and K-Nearest Neighbors - not by their prediction accuracy, but by their ability to produce diverse and non-deterministic outputs across shuffled data inputs. Randomness analysis reveals that Random Forest and Extra Trees regressors exhibit complete dependency on randomness, whereas Gradient Boosting, K-Nearest Neighbors and LightGBM show strong but slightly lower randomness scores (97.6%, 98.8% and 99.9%, respectively). These findings highlight that certain machine learning models, particularly tree-based ensembles, may serve as effective and lightweight nonce generators within blockchain-secured, SDN-based energy trading infrastructures resilient to disaster conditions.

[259] Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Data

Jesse van Remmerden, Zaharah Bukhsh, Yingqian Zhang

Main category: cs.LG

TL;DR: CDQAC is an offline RL algorithm that learns job-shop scheduling policies directly from historical data without online interactions, outperforming both data-generating heuristics and state-of-the-art RL methods with high sample efficiency.

Details

Motivation: Existing online RL methods for JSP/FJSP require millions of costly simulated interactions and suffer from poor sample efficiency due to random policy initialization, limiting their practical applicability.

Method: Conservative Discrete Quantile Actor-Critic (CDQAC) couples a quantile-based critic with delayed policy updates, estimating return distributions for machine-operation pairs rather than direct selection.

Result: CDQAC consistently outperforms original data-generating heuristics and state-of-the-art offline/online RL baselines, achieving high performance with only 10-20 training instances. Surprisingly performs better with random heuristic data than higher-quality genetic algorithm data.

Conclusion: CDQAC provides an effective offline RL solution for scheduling problems that eliminates costly online interactions while maintaining ability to improve upon suboptimal training data, demonstrating remarkable sample efficiency and performance.

Abstract: The Job-Shop Scheduling Problem (JSP) and Flexible Job-Shop Scheduling Problem (FJSP), are canonical combinatorial optimization problems with wide-ranging applications in industrial operations. In recent years, many online reinforcement learning (RL) approaches have been proposed to learn constructive heuristics for JSP and FJSP. Although effective, these online RL methods require millions of interactions with simulated environments that may not capture real-world complexities, and their random policy initialization leads to poor sample efficiency. To address these limitations, we introduce Conservative Discrete Quantile Actor-Critic (CDQAC), a novel offline RL algorithm that learns effective scheduling policies directly from historical data, eliminating the need for costly online interactions, while maintaining the ability to improve upon suboptimal training data. CDQAC couples a quantile-based critic with a delayed policy update, estimating the return distribution of each machine-operation pair rather than selecting pairs outright. Our extensive experiments demonstrate CDQAC’s remarkable ability to learn from diverse data sources. CDQAC consistently outperforms the original data-generating heuristics and surpasses state-of-the-art offline and online RL baselines. In addition, CDQAC is highly sample efficient, requiring only 10-20 training instances to learn high-quality policies. Surprisingly, we find that CDQAC performs better when trained on data generated by a random heuristic than when trained on higher-quality data from genetic algorithms and priority dispatching rules.

[260] GraphCSVAE: Graph Categorical Structured Variational Autoencoder for Spatiotemporal Auditing of Physical Vulnerability Towards Sustainable Post-Disaster Risk Reduction

Joshua Dimasaka, Christian Geiß, Robert Muir-Wood, Emily So

Main category: cs.LG

TL;DR: A novel GraphCSVAE framework for modeling physical vulnerability using satellite data and expert knowledge, applied to disaster-affected areas in Bangladesh and Sierra Leone.

Details

Motivation: Address the gap in modeling physical vulnerability for disaster risk reduction, as current methods focus mainly on hazard and exposure but lack progress in vulnerability modeling.

Method: Graph Categorical Structured Variational Autoencoder (GraphCSVAE) that integrates deep learning, graph representation, and categorical probabilistic inference with time-series satellite data and expert belief systems.

Result: Revealed post-disaster regional dynamics in physical vulnerability in cyclone-impacted Bangladesh and mudslide-affected Sierra Leone, providing insights for localized auditing.

Conclusion: The framework offers valuable tools for spatiotemporal vulnerability assessment and sustainable post-disaster risk reduction strategies.

Abstract: In the aftermath of disasters, many institutions worldwide face challenges in continually monitoring changes in disaster risk, limiting the ability of key decision-makers to assess progress towards the UN Sendai Framework for Disaster Risk Reduction 2015-2030. While numerous efforts have substantially advanced the large-scale modeling of hazard and exposure through Earth observation and data-driven methods, progress remains limited in modeling another equally important yet challenging element of the risk equation: physical vulnerability. To address this gap, we introduce Graph Categorical Structured Variational Autoencoder (GraphCSVAE), a novel probabilistic data-driven framework for modeling physical vulnerability by integrating deep learning, graph representation, and categorical probabilistic inference, using time-series satellite-derived datasets and prior expert belief systems. We introduce a weakly supervised first-order transition matrix that reflects the changes in the spatiotemporal distribution of physical vulnerability in two disaster-stricken and socioeconomically disadvantaged areas: (1) the cyclone-impacted coastal Khurushkul community in Bangladesh and (2) the mudslide-affected city of Freetown in Sierra Leone. Our work reveals post-disaster regional dynamics in physical vulnerability, offering valuable insights into localized spatiotemporal auditing and sustainable strategies for post-disaster risk reduction.

[261] ARMA Block: A CNN-Based Autoregressive and Moving Average Module for Long-Term Time Series Forecasting

Myung Jin Kim, YeongHyeon Park, Il Dong Yun

Main category: cs.LG

TL;DR: A simple convolutional module called ARMA for long-term time series forecasting, inspired by ARIMA but with direct multi-step prediction capability and better scalability to multivariate data.

Details

Motivation: To create a more efficient and scalable alternative to traditional ARIMA models for time series forecasting, addressing the limitations of iterative multi-step forecasting and enabling better handling of multivariate data.

Method: Proposes a convolutional block with two components: one for capturing trend (autoregression) and another for refining local variations (moving average). The block directly performs multi-step forecasting instead of iterative approaches.

Result: Achieves competitive accuracy on nine benchmark datasets, particularly excelling on datasets with strong trend variations, while maintaining architectural simplicity. The block also inherently encodes absolute positional information.

Conclusion: ARMA block provides an effective and lightweight solution for time series forecasting, potentially serving as a replacement for positional embeddings in sequential models due to its inherent positional encoding capability.

Abstract: This paper proposes a simple yet effective convolutional module for long-term time series forecasting. The proposed block, inspired by the Auto-Regressive Integrated Moving Average (ARIMA) model, consists of two convolutional components: one for capturing the trend (autoregression) and the other for refining local variations (moving average). Unlike conventional ARIMA, which requires iterative multi-step forecasting, the block directly performs multi-step forecasting, making it easily extendable to multivariate settings. Experiments on nine widely used benchmark datasets demonstrate that our method ARMA achieves competitive accuracy, particularly on datasets exhibiting strong trend variations, while maintaining architectural simplicity. Furthermore, analysis shows that the block inherently encodes absolute positional information, suggesting its potential as a lightweight replacement for positional embeddings in sequential models.

[262] Physics-informed sensor coverage through structure preserving machine learning

Benjamin David Shaffer, Brooks Kinch, Joseph Klobusicky, M. Ani Hsieh, Nathaniel Trask

Main category: cs.LG

TL;DR: A machine learning framework using structure-preserving digital twins with conditional neural Whitney forms for adaptive source localization in hydrodynamic-transport systems, combining FEEC guarantees with transformer-based operator learning for real-time trajectory planning and data assimilation.

Details

Motivation: To develop an adaptive source localization system that preserves physical structure and conservation laws while enabling real-time adaptation to streaming sensor data, overcoming limitations of physics-agnostic approaches in complex geometries.

Method: Uses conditional neural Whitney forms (CNWF) coupling finite element exterior calculus with transformer-based operator learning. Implements staggered scheme alternating between digital twin evaluation and Lloyd’s algorithm for sensor placement. Employs conditional attention mechanism to identify reduced Whitney-form basis, integral balance equations, and source field compatible with sensor measurements.

Result: The framework preserves discrete conservation, adapts in real-time to sensor data, and demonstrates improved accuracy in complex geometries compared to physics-agnostic transformer architectures. Enables recovery of point sources under continuity assumptions with physically realizable mappings from sensor data to source fields.

Conclusion: Structure preservation through CNWF provides an effective inductive bias for source identification, retaining stability and consistency of finite-element simulation while enabling adaptive localization with monotone improvement of coverage functional through optimal sensor placement.

Abstract: We present a machine learning framework for adaptive source localization in which agents use a structure-preserving digital twin of a coupled hydrodynamic-transport system for real-time trajectory planning and data assimilation. The twin is constructed with conditional neural Whitney forms (CNWF), coupling the numerical guarantees of finite element exterior calculus (FEEC) with transformer-based operator learning. The resulting model preserves discrete conservation, and adapts in real time to streaming sensor data. It employs a conditional attention mechanism to identify: a reduced Whitney-form basis; reduced integral balance equations; and a source field, each compatible with given sensor measurements. The induced reduced-order environmental model retains the stability and consistency of standard finite-element simulation, yielding a physically realizable, regular mapping from sensor data to the source field. We propose a staggered scheme that alternates between evaluating the digital twin and applying Lloyd’s algorithm to guide sensor placement, with analysis providing conditions for monotone improvement of a coverage functional. Using the predicted source field as an importance function within an optimal-recovery scheme, we demonstrate recovery of point sources under continuity assumptions, highlighting the role of regularity as a sufficient condition for localization. Experimental comparisons with physics-agnostic transformer architectures show improved accuracy in complex geometries when physical constraints are enforced, indicating that structure preservation provides an effective inductive bias for source identification.

[263] A Discrepancy-Based Perspective on Dataset Condensation

Tong Chen, Raghavendra Selvan

Main category: cs.LG

TL;DR: A unified framework for dataset condensation that extends beyond generalization to include robustness, privacy, and other properties using discrepancy measures.

Details

Motivation: To generalize dataset condensation beyond task-specific approaches and provide a formal mathematical foundation using distribution discrepancy measures.

Method: Proposes a unified framework that encompasses existing DC methods and extends them using notions of discrepancy to quantify distance between probability distributions.

Result: The framework broadens DC objectives to include additional desirable properties beyond just generalization performance.

Conclusion: The work provides a more comprehensive and formal definition of dataset condensation that can accommodate multiple objectives including robustness and privacy.

Abstract: Given a dataset of finitely many elements $\mathcal{T} = {\mathbf{x}i}{i = 1}^N$, the goal of dataset condensation (DC) is to construct a synthetic dataset $\mathcal{S} = {\tilde{\mathbf{x}}j}{j = 1}^M$ which is significantly smaller ($M \ll N$) such that a model trained from scratch on $\mathcal{S}$ achieves comparable or even superior generalization performance to a model trained on $\mathcal{T}$. Recent advances in DC reveal a close connection to the problem of approximating the data distribution represented by $\mathcal{T}$ with a reduced set of points. In this work, we present a unified framework that encompasses existing DC methods and extend the task-specific notion of DC to a more general and formal definition using notions of discrepancy, which quantify the distance between probability distribution in different regimes. Our framework broadens the objective of DC beyond generalization, accommodating additional objectives such as robustness, privacy, and other desirable properties.

[264] Data distribution impacts the performance and generalisability of contrastive learning-based foundation models of electrocardiograms

Gul Rukh Khattak, Konstantinos Patlatzoglou, Joseph Barker, Libor Pastika, Boroumand Zeidaabadi, Ahmed El-Medany, Hesham Aggour, Yixiu Liang, Antonio H. Ribeiro, Jeffrey Annis, Antonio Luiz Pinho Ribeiro, Junbo Ge, Daniel B. Kramer, Jonathan W. Waks, Evan Brittain, Nicholas Peters, Fu Siong Ng, Arunashis Sau

Main category: cs.LG

TL;DR: CAPE foundation model uses contrastive learning on ECG data from diverse populations, showing that pretraining cohort composition affects downstream performance. Multi-center diverse cohorts improve in-distribution accuracy but reduce OOD generalization due to cohort-specific artifacts. Proposed IDB strategy enhances OOD robustness.

Details

Motivation: To understand how cohort demographics, health status, and population diversity influence contrastive learning performance in ECG analysis, and to address the limitations of current approaches in out-of-distribution generalization.

Method: Developed CAPE foundation model using contrastive learning pretrained on 5.2M ECGs from diverse populations across three continents. Systematically assessed cohort composition effects and proposed In-Distribution Batch (IDB) strategy to preserve intra-cohort consistency during pretraining.

Result: Downstream performance depends on pretraining cohort distributional properties. Multi-center diverse cohorts improve in-distribution accuracy but reduce OOD generalization by encoding cohort-specific artifacts. IDB strategy successfully enhances OOD robustness.

Conclusion: Cohort composition significantly impacts contrastive learning performance. The proposed IDB strategy provides important insights for developing clinically fair and generalizable foundation models that work well across diverse populations.

Abstract: Contrastive learning is a widely adopted self-supervised pretraining strategy, yet its dependence on cohort composition remains underexplored. We present Contrasting by Patient Augmented Electrocardiograms (CAPE) foundation model and pretrain on four cohorts (n = 5,203,352), from diverse populations across three continents (North America, South America, Asia). We systematically assess how cohort demographics, health status, and population diversity influence the downstream performance for prediction tasks also including two additional cohorts from another continent (Europe). We find that downstream performance depends on the distributional properties of the pretraining cohort, including demographics and health status. Moreover, while pretraining with a multi-centre, demographically diverse cohort improves in-distribution accuracy, it reduces out-of-distribution (OOD) generalisation of our contrastive approach by encoding cohort-specific artifacts. To address this, we propose the In-Distribution Batch (IDB) strategy, which preserves intra-cohort consistency during pretraining and enhances OOD robustness. This work provides important insights for developing clinically fair and generalisable foundation models.

[265] Flow Straight and Fast in Hilbert Space: Functional Rectified Flow

Jianxin Zhang, Clayton Scott

Main category: cs.LG

TL;DR: Functional extension of rectified flow to infinite-dimensional Hilbert spaces with rigorous formulation and superior performance over existing functional generative models.

Details

Motivation: Rectified flow has been successful in finite-dimensional Euclidean spaces but remains unexplored in infinite-dimensional settings, while existing functional flow matching has restrictive measure-theoretic assumptions.

Method: Established rigorous functional formulation of rectified flow using superposition principle for continuity equations in infinite-dimensional Hilbert space, extending to functional flow matching and functional probability flow ODEs as nonlinear generalizations.

Result: The framework removes restrictive measure-theoretic assumptions from existing functional flow matching theory and demonstrates superior experimental performance compared to existing functional generative models.

Conclusion: This work successfully extends rectified flow to infinite-dimensional spaces, providing a more general and less restrictive framework for functional generative modeling with improved performance.

Abstract: Many generative models originally developed in finite-dimensional Euclidean space have functional generalizations in infinite-dimensional settings. However, the extension of rectified flow to infinite-dimensional spaces remains unexplored. In this work, we establish a rigorous functional formulation of rectified flow in an infinite-dimensional Hilbert space. Our approach builds upon the superposition principle for continuity equations in an infinite-dimensional space. We further show that this framework extends naturally to functional flow matching and functional probability flow ODEs, interpreting them as nonlinear generalizations of rectified flow. Notably, our extension to functional flow matching removes the restrictive measure-theoretic assumptions in the existing theory of \citet{kerrigan2024functional}. Furthermore, we demonstrate experimentally that our method achieves superior performance compared to existing functional generative models.

[266] Vendi Information Gain for Active Learning and its Application to Ecology

Quan Nguyen, Adji Bousso Dieng

Main category: cs.LG

TL;DR: Vendi information gain (VIG) is a new active learning policy that selects images based on dataset-wide prediction uncertainty, achieving near-full-supervision accuracy with <10% labels on biodiversity monitoring data.

Details

Motivation: Camera trap biodiversity monitoring faces labeling bottlenecks; existing active learning methods focus on individual prediction uncertainty without considering dataset-wide uncertainty.

Method: Vendi information gain (VIG) policy selects images based on impact on dataset-wide prediction uncertainty, capturing both informativeness and diversity in feature space.

Result: On Snapshot Serengeti dataset, VIG achieves predictive accuracy close to full supervision using less than 10% labels, outperforming standard baselines across metrics and collecting more diverse data.

Conclusion: VIG has broad applicability beyond ecology and demonstrates significant value for biodiversity monitoring in data-limited environments through efficient label usage.

Abstract: While monitoring biodiversity through camera traps has become an important endeavor for ecological research, identifying species in the captured image data remains a major bottleneck due to limited labeling resources. Active learning – a machine learning paradigm that selects the most informative data to label and train a predictive model – offers a promising solution, but typically focuses on uncertainty in the individual predictions without considering uncertainty across the entire dataset. We introduce a new active learning policy, Vendi information gain (VIG), that selects images based on their impact on dataset-wide prediction uncertainty, capturing both informativeness and diversity. Applied to the Snapshot Serengeti dataset, VIG achieves impressive predictive accuracy close to full supervision using less than 10% of the labels. It consistently outperforms standard baselines across metrics and batch sizes, collecting more diverse data in the feature space. VIG has broad applicability beyond ecology, and our results highlight its value for biodiversity monitoring in data-limited environments.

[267] Inpainting-Guided Policy Optimization for Diffusion Large Language Models

Siyan Zhao, Mengchen Liu, Jing Huang, Miao Liu, Chenyu Wang, Bo Liu, Yuandong Tian, Guan Pang, Sean Bell, Aditya Grover, Feiyu Chen

Main category: cs.LG

TL;DR: IGPO is a new RL framework that uses dLLMs’ inpainting capability to guide exploration by strategically inserting partial ground-truth reasoning traces during sampling, improving sample efficiency and achieving SOTA results on math benchmarks.

Details

Motivation: Autoregressive LLMs face exploration challenges in RL with sparse rewards and sample waste. Masked dLLMs offer unique inpainting capabilities that can guide exploration more efficiently.

Method: IGPO framework inserts partial ground-truth reasoning traces during online sampling, bridging supervised fine-tuning and RL. Includes synthetic trace rewriting, entropy-based filtering, and applied to group-based optimization methods like GRPO.

Result: Achieved substantial gains across GSM8K, Math500, and AMC benchmarks, setting new state-of-the-art results for full-attention masked dLLMs while restoring meaningful gradients and improving sample efficiency.

Conclusion: Inpainting-guided exploration effectively addresses RL challenges for dLLMs, demonstrating that strategic use of partial ground-truth can significantly improve training efficiency and performance on mathematical reasoning tasks.

Abstract: Masked diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive LLMs, offering competitive performance while supporting unique generation capabilities such as inpainting. We explore how inpainting can inform RL algorithm design for dLLMs. Aligning LLMs with reinforcement learning faces an exploration challenge: sparse reward signals and sample waste when models fail to discover correct solutions. While this inefficiency affects LLMs broadly, dLLMs offer a distinctive opportunity–their inpainting ability can guide exploration. We introduce IGPO (Inpainting Guided Policy Optimization), an RL framework that strategically inserts partial ground-truth reasoning traces during online sampling. Unlike providing full solutions, inpainting steers exploration toward promising trajectory spaces while preserving self-generated reasoning, bridging supervised fine-tuning and reinforcement learning. We apply IGPO to group-based optimization methods such as GRPO, where exploration failures cause zero advantages and gradients. IGPO restores meaningful gradients while improving sample efficiency. We also propose supervised fine-tuning on synthetically rewritten concise traces that better align with dLLM generation patterns. With additional techniques including entropy-based filtering, our training recipe yields substantial gains across three mathematical benchmarks–GSM8K, Math500, and AMC–achieving new state-of-the-art results for full-attention masked dLLMs.

[268] Multipole Semantic Attention: A Fast Approximation of Softmax Attention for Pretraining

Rupert Mitchell, Kristian Kersting

Main category: cs.LG

TL;DR: MuSe is an efficient attention approximation that combines semantic clustering with multipole expansions to reduce transformer’s quadratic complexity, achieving 3x speedup and minimal performance loss.

Details

Motivation: Address the quadratic computational complexity of transformers in context length by developing an efficient attention approximation that maintains accuracy while reducing computation.

Method: Clusters queries and keys separately in learned representation spaces using hierarchical two-stage attention with centroid-based approximations and dipole corrections for directional variance. Operates as drop-in replacement for standard attention.

Result: Achieves O(NCD) complexity for acausal attention, 3x speedup over Flash Attention at 8k context length with <20% relative squared errors, and 12.2% runtime reduction with only 0.36% loss degradation in end-to-end pretraining.

Conclusion: Multipole Semantic Attention provides an efficient and viable approximation for transformer pretraining, significantly reducing computational complexity while maintaining performance through separate clustering and multipole expansions.

Abstract: We present Multipole Semantic Attention (MuSe), an efficient approximation of softmax attention that combines semantic clustering with multipole expansions from computational physics. Our method addresses the quadratic computational complexity of transformers in the context length by clustering queries and keys separately in their learned representation spaces, enabling a hierarchical two-stage attention mechanism. Unlike prior clustering approaches that group only keys or use unified clustering, we maintain separate clusterings that respect attention’s asymmetric treatment of these spaces. We augment centroid-based (monopole) approximations with dipole corrections that capture directional variance within clusters, preserving richer information during training. The method operates as a drop-in replacement for standard attention, requiring only hyperparameter specification without architectural modifications. Our approach achieves $\mathcal{O}(NCD)$ complexity for acausal attention with $C$ clusters and $\mathcal{O}(NCD \log N)$ for causal attention. On isolated attention layers, we demonstrate $3\times$ speedup over CUDNN Flash Attention at 8k context length, with relative squared errors below 20%. For causal attention, we develop a hierarchical block decomposition that combines exact local computation with efficient long-range approximation. In end-to-end pretraining of a 30M parameter model on book-length texts with 16k context, we achieve 12.2% runtime reduction with only 0.36% loss degradation, establishing the viability of multipole approximations for efficient transformer pretraining.

[269] Run-Time Monitoring of ERTMS/ETCS Control Flow by Process Mining

Francesco Vitale, Tommaso Zoppi, Francesco Flammini, Nicola Mazzocca

Main category: cs.LG

TL;DR: Process mining and machine learning approach for run-time control-flow anomaly detection in railway systems to enhance resilience against unknown faults and cyber-threats.

Details

Motivation: Railway systems face increasing complexity and criticality, with potential anomalies from residual faults, system modifications, and cyber-threats that weren't known at design time, requiring enhanced run-time monitoring.

Method: Uses process mining to learn actual control flow from execution traces for online conformance checking, combined with unsupervised machine learning for anomaly localization to link deviations to critical components.

Result: Tested on ERTMS/ETCS L2 RBC/RBC Handover scenario, showing high accuracy, efficiency, and explainability in detecting and localizing anomalies.

Conclusion: The approach effectively enhances railway system resilience by providing run-time monitoring capabilities that can detect and explain anomalies in critical control systems.

Abstract: Ensuring the resilience of computer-based railways is increasingly crucial to account for uncertainties and changes due to the growing complexity and criticality of those systems. Although their software relies on strict verification and validation processes following well-established best-practices and certification standards, anomalies can still occur at run-time due to residual faults, system and environmental modifications that were unknown at design-time, or other emergent cyber-threat scenarios. This paper explores run-time control-flow anomaly detection using process mining to enhance the resilience of ERTMS/ETCS L2 (European Rail Traffic Management System / European Train Control System Level 2). Process mining allows learning the actual control flow of the system from its execution traces, thus enabling run-time monitoring through online conformance checking. In addition, anomaly localization is performed through unsupervised machine learning to link relevant deviations to critical system components. We test our approach on a reference ERTMS/ETCS L2 scenario, namely the RBC/RBC Handover, to show its capability to detect and localize anomalies with high accuracy, efficiency, and explainability.

[270] Understanding Outer Optimizers in Local SGD: Learning Rates, Momentum, and Acceleration

Ahmed Khaled, Satyen Kale, Arthur Douillard, Chi Jin, Rob Fergus, Manzil Zaheer

Main category: cs.LG

TL;DR: This paper analyzes the role of outer optimizer hyperparameters in Local SGD, showing that tuning the outer learning rate (sometimes >1) can trade off optimization error vs gradient noise and compensate for poor inner learning rate tuning. The study extends to momentum and acceleration in outer optimizers, providing new convergence guarantees and data-dependent analysis.

Details

Motivation: Communication is a major bottleneck in distributed machine learning with large batch sizes. While Local SGD reduces communication overhead, the choice and tuning of outer optimizer hyperparameters is unclear compared to well-studied local optimization parameters.

Method: Theoretical analysis of Local SGD convergence with different outer optimizers, including proofs for convergence guarantees. Study of outer learning rate tuning, momentum, and acceleration. Comprehensive experiments with standard language models to validate theoretical findings.

Result: Tuning outer learning rate allows trade-off between optimization error and stochastic gradient noise variance, and can compensate for poor inner learning rate tuning. Outer learning rate should sometimes be >1. Momentum and acceleration in outer optimizer improve convergence rates compared to prior local acceleration methods.

Conclusion: The outer optimizer plays a crucial role in Local SGD performance. Proper tuning of outer learning rate (including values >1), momentum, and acceleration significantly improves convergence and communication efficiency in distributed training settings.

Abstract: Modern machine learning often requires training with large batch size, distributed data, and massively parallel compute hardware (like mobile and other edge devices or distributed data centers). Communication becomes a major bottleneck in such settings but methods like Local Stochastic Gradient Descent (Local SGD) show great promise in reducing this additional communication overhead. Local SGD consists of three parts: a local optimization process, an aggregation mechanism, and an outer optimizer that uses the aggregated updates from the nodes to produce a new model. While there exists an extensive literature on understanding the impact of hyperparameters in the local optimization process, the choice of outer optimizer and its hyperparameters is less clear. We study the role of the outer optimizer in Local SGD, and prove new convergence guarantees for the algorithm. In particular, we show that tuning the outer learning rate allows us to (a) trade off between optimization error and stochastic gradient noise variance, and (b) make up for ill-tuning of the inner learning rate. Our theory suggests that the outer learning rate should sometimes be set to values greater than $1$. We extend our results to settings where we use momentum in the outer optimizer, and we show a similar role for the momentum-adjusted outer learning rate. We also study acceleration in the outer optimizer and show that it improves the convergence rate as a function of the number of communication rounds, improving upon the convergence rate of prior algorithms that apply acceleration locally. Finally, we also introduce a novel data-dependent analysis of Local SGD that yields further insights on outer learning rate tuning. We conduct comprehensive experiments with standard language models and various outer optimizers to validate our theory.

[271] Sufficient Invariant Learning for Distribution Shift

Taero Kim, Subeen Park, Sungjun Lim, Yonghan Jung, Krikamol Muandet, Kyungwoo Song

Main category: cs.LG

TL;DR: Proposes Sufficient Invariant Learning (SIL) framework and ASGDRO algorithm to learn diverse invariant features through common flat minima across environments, addressing limitations of existing methods when invariant features are partially observed.

Details

Motivation: Existing invariant learning methods assume fully observed invariant features in both training and test sets, which is often violated in practice, leading to deteriorated robustness when models rely on invariant features absent in test environments.

Method: Introduces SIL framework to learn sufficient subset of invariant features rather than single features, and proposes ASGDRO algorithm that seeks common flat minima across environments to learn diverse invariant features.

Result: Theoretical demonstration that finding common flat minima enables robust predictions with diverse invariant features. Empirical evaluations on multiple datasets confirm ASGDRO’s robustness against distribution shifts.

Conclusion: ASGDRO effectively addresses limitations of existing invariant learning methods by learning diverse invariant features through common flat minima, providing improved robustness against distribution shifts when invariant features are partially observed.

Abstract: Learning robust models under distribution shifts between training and test datasets is a fundamental challenge in machine learning. While learning invariant features across environments is a popular approach, it often assumes that these features are fully observed in both training and test sets, a condition frequently violated in practice. When models rely on invariant features absent in the test set, their robustness in new environments can deteriorate. To tackle this problem, we introduce a novel learning principle called the Sufficient Invariant Learning (SIL) framework, which focuses on learning a sufficient subset of invariant features rather than relying on a single feature. After demonstrating the limitation of existing invariant learning methods, we propose a new algorithm, Adaptive Sharpness-aware Group Distributionally Robust Optimization (ASGDRO), to learn diverse invariant features by seeking common flat minima across the environments. We theoretically demonstrate that finding a common flat minima enables robust predictions based on diverse invariant features. Empirical evaluations on multiple datasets, including our new benchmark, confirm ASGDRO’s robustness against distribution shifts, highlighting the limitations of existing methods.

[272] Analyzing the Impact of Adversarial Examples on Explainable Machine Learning

Prathyusha Devabhakthini, Sasmita Parida, Raj Mani Shukla, Suvendu Chandan Nayak, Tapadhir Das

Main category: cs.LG

TL;DR: Analysis of how adversarial attacks affect model interpretability in text classification, showing that attacks degrade both performance and explainability.

Details

Motivation: Adversarial attacks can seriously compromise ML models in critical applications, and understanding their impact on model interpretability is crucial for trustworthy AI systems.

Method: Developed an ML-based text classification model, introduced adversarial perturbations to the text data, then analyzed classification performance and model explainability before and after the attacks.

Result: Adversarial attacks significantly degraded both the classification performance and the interpretability/explainability of the text classification model.

Conclusion: Adversarial attacks not only compromise model accuracy but also undermine model interpretability, highlighting the need for robust defenses that preserve both performance and explainability in text classification systems.

Abstract: Adversarial attacks are a type of attack on machine learning models where an attacker deliberately modifies the inputs to cause the model to make incorrect predictions. Adversarial attacks can have serious consequences, particularly in applications such as autonomous vehicles, medical diagnosis, and security systems. Work on the vulnerability of deep learning models to adversarial attacks has shown that it is very easy to make samples that make a model predict things that it doesn’t want to. In this work, we analyze the impact of model interpretability due to adversarial attacks on text classification problems. We develop an ML-based classification model for text data. Then, we introduce the adversarial perturbations on the text data to understand the classification performance after the attack. Subsequently, we analyze and interpret the model’s explainability before and after the attack

[273] Interpretable Data-driven Anomaly Detection in Industrial Processes with ExIFFI

Davide Frizzo, Francesco Borsatti, Alessio Arcudi, Antonio De Moliner, Roberto Oboe, Gian Antonio Susto

Main category: cs.LG

TL;DR: ExIFFI provides fast, efficient explanations for anomaly detection in industrial settings, outperforming other explainable AD models in both effectiveness and computational efficiency.

Details

Motivation: Conventional anomaly detection methods only label observations as normal/anomalous without providing insights. Industry 5.0 requires interpretable outcomes to help users understand the reasoning behind model decisions.

Method: Applies ExIFFI (an explanation approach for Extended Isolation Forest) to three industrial datasets, comparing it against other state-of-the-art explainable anomaly detection models.

Result: ExIFFI demonstrates superior explanation effectiveness and computational efficiency compared to other explainable AD models.

Conclusion: ExIFFI represents the first successful industrial application of explainable anomaly detection that provides both fast performance and meaningful insights into model decisions.

Abstract: Anomaly Detection (AD) is crucial in industrial settings to streamline operations by detecting underlying issues. Conventional methods merely label observations as normal or anomalous, lacking crucial insights. In Industry 5.0, interpretable outcomes become desirable to enable users to understand the rational under model decisions. This paper presents the first industrial application of ExIFFI, a recent approach for fast, efficient explanations for the Extended Isolation Forest (EIF) (AD) method. ExIFFI is tested on three industrial datasets, demonstrating superior explanation effectiveness and computational efficiency compared to other state-of-the-art explainable AD models.

[274] A Survey on Group Fairness in Federated Learning: Challenges, Taxonomy of Solutions and Directions for Future Research

Teresa Salazar, Helder Araújo, Alberto Cano, Pedro Henriques Abreu

Main category: cs.LG

TL;DR: This paper provides the first comprehensive survey on group fairness in Federated Learning, analyzing 48 research works, proposing identification practices, creating a taxonomy, and discussing ethical implications.

Details

Motivation: Federated Learning's decentralized nature with heterogeneous data distributions can exacerbate biases, creating an urgent need for fairness methodologies, yet no comprehensive survey exists on group fairness in this context.

Method: The authors analyze 48 research works, propose practices for identification and benchmarking, create a novel taxonomy based on data partitioning, location, and strategy, and examine datasets, applications, and broader concerns.

Result: A comprehensive analysis of group fairness challenges in FL, including a novel taxonomy, identification practices, and examination of how different approaches handle various sensitive attributes and complexities.

Conclusion: Key areas for future research are highlighted, emphasizing the need for more methods to address the complexities of achieving group fairness in federated systems, along with ethical, legal, and policy implications.

Abstract: Group fairness in machine learning is an important area of research focused on achieving equitable outcomes across different groups defined by sensitive attributes such as race or gender. Federated Learning, a decentralized approach to training machine learning models across multiple clients, amplifies the need for fairness methodologies due to its inherent heterogeneous data distributions that can exacerbate biases. The intersection of Federated Learning and group fairness has attracted significant interest, with 48 research works specifically dedicated to addressing this issue. However, no comprehensive survey has specifically focused on group fairness in Federated Learning. In this work, we analyze the key challenges of this topic, propose practices for its identification and benchmarking, and create a novel taxonomy based on criteria such as data partitioning, location, and strategy. Furthermore, we analyze broader concerns, review how different approaches handle the complexities of various sensitive attributes, examine common datasets and applications, and discuss the ethical, legal, and policy implications of group fairness in FL. We conclude by highlighting key areas for future research, emphasizing the need for more methods to address the complexities of achieving group fairness in federated systems.

[275] Is Adversarial Training with Compressed Datasets Effective?

Tong Chen, Raghavendra Selvan

Main category: cs.LG

TL;DR: This paper investigates adversarial robustness in dataset compression methods, showing current DC methods fail to transfer robustness and proposing a new robustness-aware compression method called Minimal Finite Covering (MFC) that provides provable robustness.

Details

Motivation: Current dataset condensation methods focus on achieving high test performance with limited data but neglect adversarial robustness, creating a gap in compressed datasets' ability to transfer robustness properties to trained models.

Method: The authors propose a robustness-aware dataset compression method based on finding the Minimal Finite Covering (MFC) of the dataset, which minimizes generalized adversarial loss and provides provable robustness guarantees.

Result: The compressed datasets from standard DC methods are ineffective for transferring adversarial robustness, while the proposed MFC method is more effective when applying adversarial training and provides provable robustness.

Conclusion: The MFC approach offers a one-time computation solution that is applicable to any model, simultaneously improving dataset compression efficiency and adversarial robustness, addressing a critical limitation in current dataset condensation methods.

Abstract: Dataset Condensation (DC) refers to the recent class of dataset compression methods that generate a smaller, synthetic, dataset from a larger dataset. This synthetic dataset aims to retain the essential information of the original dataset, enabling models trained on it to achieve performance levels comparable to those trained on the full dataset. Most current DC methods have mainly concerned with achieving high test performance with limited data budget, and have not directly addressed the question of adversarial robustness. In this work, we investigate the impact of adversarial robustness on models trained with compressed datasets. We show that the compressed datasets obtained from DC methods are not effective in transferring adversarial robustness to models. As a solution to improve dataset compression efficiency and adversarial robustness simultaneously, we present a robustness-aware dataset compression method based on finding the Minimal Finite Covering (MFC) of the dataset. The proposed method is (1) provably robust by minimizing the generalized adversarial loss, (2) more effective than DC methods when applying adversarial training over MFC, (3) obtained by a one-time computation and is applicable for any model.

[276] Unveiling Group-Specific Distributed Concept Drift: A Fairness Imperative in Federated Learning

Teresa Salazar, João Gama, Helder Araújo, Pedro Henriques Abreu

Main category: cs.LG

TL;DR: This paper introduces the problem of group-specific concept drift in federated learning, where different demographic groups experience concept drift at different rates, threatening fairness. The authors formalize this problem and adapt an existing distributed concept drift algorithm to handle group-specific drift using multi-model approach and local drift detection.

Details

Motivation: Achieving group fairness in machine learning becomes challenging when different demographic groups experience concept drift at different rates. In federated learning environments, this problem is amplified as each client may experience group-specific concept drift independently while sharing the same underlying concept.

Method: The authors adapt an existing distributed concept drift adaptation algorithm to handle group-specific distributed concept drift. The approach uses a multi-model strategy, local group-specific drift detection mechanism, and continuous clustering of models over time.

Result: Experimental findings demonstrate the critical importance of addressing group-specific concept drift and its distributed counterpart to advance fairness in machine learning systems.

Conclusion: The research formally introduces and addresses the previously unexplored problem of group-specific concept drift in distributed learning environments, providing a framework and algorithmic approach to maintain fairness when different demographic groups experience concept drift at varying rates.

Abstract: In the evolving field of machine learning, ensuring group fairness has become a critical concern, prompting the development of algorithms designed to mitigate bias in decision-making processes. Group fairness refers to the principle that a model’s decisions should be equitable across different groups defined by sensitive attributes such as gender or race, ensuring that individuals from privileged groups and unprivileged groups are treated fairly and receive similar outcomes. However, achieving fairness in the presence of group-specific concept drift remains an unexplored frontier, and our research represents pioneering efforts in this regard. Group-specific concept drift refers to situations where one group experiences concept drift over time while another does not, leading to a decrease in fairness even if accuracy remains fairly stable. Within the framework of Federated Learning, where clients collaboratively train models, its distributed nature further amplifies these challenges since each client can experience group-specific concept drift independently while still sharing the same underlying concept, creating a complex and dynamic environment for maintaining fairness. The most significant contribution of our research is the formalization and introduction of the problem of group-specific concept drift and its distributed counterpart, shedding light on its critical importance in the field of fairness. Additionally, leveraging insights from prior research, we adapt an existing distributed concept drift adaptation algorithm to tackle group-specific distributed concept drift which uses a multi-model approach, a local group-specific drift detection mechanism, and continuous clustering of models over time. The findings from our experiments highlight the importance of addressing group-specific concept drift and its distributed counterpart to advance fairness in machine learning.

[277] A Novel Approach to Balance Convenience and Nutrition in Meals With Long-Term Group Recommendations and Reasoning on Multimodal Recipes and its Implementation in BEACON

Vansh Nagpal, Siva Likitha Valluru, Kausik Lakkaraju, Nitin Gupta, Zach Abdulrahman, Andrew Davison, Biplav Srivastava

Main category: cs.LG

TL;DR: A data-driven meal recommendation system that balances nutrition and convenience using customizable meal configurations and contextual bandit learning methods.

Details

Motivation: People face tradeoffs between nutritious choices and convenience when selecting meals, needing a solution that considers both user preferences and food constituents.

Method: Introduces goodness measures, recipe conversion to multimodal R3 format, and learning methods using contextual bandits within the BEACON prototype system.

Result: The approach shows promising preliminary results for meal recommendations that balance nutrition content with factors like cost, accessibility, and cuisine type.

Conclusion: The BEACON system provides a data-driven solution for personalized meal recommendations that addresses the complex tradeoffs in food selection decisions.

Abstract: A common decision made by people, whether healthy or with health conditions, is choosing meals like breakfast, lunch, and dinner, comprising combinations of foods for appetizer, main course, side dishes, desserts, and beverages. Often, this decision involves tradeoffs between nutritious choices (e.g., salt and sugar levels, nutrition content) and convenience (e.g., cost and accessibility, cuisine type, food source type). We present a data-driven solution for meal recommendations that considers customizable meal configurations and time horizons. This solution balances user preferences while accounting for food constituents and cooking processes. Our contributions include introducing goodness measures, a recipe conversion method from text to the recently introduced multimodal rich recipe representation (R3) format, learning methods using contextual bandits that show promising preliminary results, and the prototype, usage-inspired, BEACON system.

[278] Uncertainty Modeling in Graph Neural Networks via Stochastic Differential Equations

Richard Bergna, Sergio Calvo-Ordoñez, Felix L. Opolka, Pietro Liò, Jose Miguel Hernandez-Lobato

Main category: cs.LG

TL;DR: LGNSDE framework enhances graph neural ODEs with stochastic differential equations to quantify both epistemic and aleatoric uncertainty in graph data, providing theoretical guarantees and empirical performance improvements.

Details

Motivation: Graph Neural ODEs (GNODEs) lack uncertainty quantification capabilities, which is crucial for reliable decision-making in graph-structured data applications. Existing methods don't provide both epistemic and aleatoric uncertainty estimation with theoretical guarantees.

Method: Latent Graph Neural SDEs (LGNSDE) embed randomness through Bayesian prior-posterior mechanism for epistemic uncertainty and Brownian motion for aleatoric uncertainty. The framework leverages existence and uniqueness of solutions to graph-based SDEs.

Result: Theoretical proof that latent space variance bounds model output variance, mathematical demonstration of robustness to input perturbations, and empirical results showing competitive performance in out-of-distribution detection, noise robustness, and active learning tasks.

Conclusion: LGNSDEs provide a theoretically grounded framework for uncertainty-aware graph representation learning with proven robustness properties and practical effectiveness across multiple benchmarks.

Abstract: We propose a novel Stochastic Differential Equation (SDE) framework to address the problem of learning uncertainty-aware representations for graph-structured data. While Graph Neural Ordinary Differential Equations (GNODEs) have shown promise in learning node representations, they lack the ability to quantify uncertainty. To address this, we introduce Latent Graph Neural Stochastic Differential Equations (LGNSDE), which enhance GNODE by embedding randomness through a Bayesian prior-posterior mechanism for epistemic uncertainty and Brownian motion for aleatoric uncertainty. By leveraging the existence and uniqueness of solutions to graph-based SDEs, we prove that the variance of the latent space bounds the variance of model outputs, thereby providing theoretically sensible guarantees for the uncertainty estimates. Furthermore, we show mathematically that LGNSDEs are robust to small perturbations in the input, maintaining stability over time. Empirical results across several benchmarks demonstrate that our framework is competitive in out-of-distribution detection, robustness to noise, and active learning, underscoring the ability of LGNSDEs to quantify uncertainty reliably. Code is available at \href{https://github.com/Richard-Bergna/GraphNeuralSDE}{\texttt{github.com/Richard-Bergna/GraphNeuralSDE}}.

[279] Input-Time Scaling

Rapheal Huang, Weilong Guo

Main category: cs.LG

TL;DR: Introduces Input-Time Scaling paradigm that refines queries using meta-knowledge from LLMs, challenging the “garbage in, garbage out” principle by showing that seemingly low-quality or irrelevant data can achieve SOTA performance with minimal data (1k examples).

Details

Motivation: To complement existing scaling methods (data scaling and inference scaling) by focusing on query refinement during input time, and to challenge conventional wisdom about data quality requirements.

Method: Utilizes meta-knowledge from LLMs to refine inputs with different strategies during both training and testing (train-test co-design), using minimally filtered datasets and even adding irrelevant information to queries.

Result: Achieved SOTA performance among 32B models: 76.7% on AIME24 and AIME25, and up to 90.0% on AIME24 and 80.0% on AIME25 with majority voting. Surprisingly found that 1k examples outperform 15k examples of similar quality.

Conclusion: Input-Time Scaling is an effective paradigm that challenges traditional data quality assumptions, demonstrates that less data can be more effective, and requires coordinated training-testing strategies for optimal performance.

Abstract: Current Large Language Models (LLMs) are usually post-trained on large-scale carefully curated datasets (data & training scaling) and doing reasoning in test time (inference time scaling). In this work, we present a new scaling paradigm, Input-Time Scaling, to complement previous scaling methods by putting resources on queries (input time). During training and testing, we utilize meta-knowledge from LLMs to refine inputs with different strategies. We also discover a new phenomenon, train-test co-design. It requires us to apply query strategies during training and testing as a whole. Only applying strategies on training or testing would seriously degrade the performance gained. We are also surprised to find that seemingly low data quality datasets can perform better. We can get the best performance even by adding irrelevant information to the queries, with randomly selected 1k examples from a minimally filtered dataset. These findings contradict the widely held inductive bias, “garbage in, garbage out”. Curating datasets with seemingly high-quality data can even potentially limit the performance ceiling. In addition, models trained on more data with similar quality (15k VS 1k) perform worse, the intuition of simply scaling the size should also be carefully inspected. The good news is that our findings are compatible with the Less is More phenomenon. 1K examples are enough to invoke high-level reasoning ability. With experiments on Qwen2.5-32B-Instruct, we are able to reach SOTA performance among 32B models on AIME24(76.7%) and AIME25(76.7%) pass@1. We can further achieve AIME24(76.7%) and AIME25(80%) with a majority vote of three models. Starting from DeepSeek-R1-Distill-Qwen-32B, the result would be 90.0% on AIME24 and 80.0% on AIME25. To facilitate reproducibility and further research, we are working on open-source our datasets, data pipelines, evaluation results, and checkpoints.

[280] Neural Force Field: Few-shot Learning of Generalized Physical Reasoning

Shiqian Li, Ruihong Shen, Yaoyu Tao, Chi Zhang, Yixin Zhu

Main category: cs.LG

TL;DR: Neural Force Field (NFF) extends Neural ODEs to learn physical dynamics through force field representations, enabling strong generalization from minimal training data by capturing core physics principles like gravity and collision.

Details

Motivation: Current AI models struggle with physical reasoning generalization, especially in out-of-distribution settings, due to inability to abstract core physical principles from observations. There's a need for representations that can efficiently learn and generalize physical dynamics from limited data.

Method: NFF framework extends Neural ODEs to learn object interactions through continuous explicit force field representations, which are integrated through ODE solvers to predict trajectories. It captures fundamental physics concepts like gravity, support, and collision.

Result: Experiments on three challenging physical reasoning tasks show NFF achieves strong generalization to unseen scenarios with only a few training examples. The physics-grounded representation enables efficient forward-backward planning and rapid adaptation through interactive refinement.

Conclusion: Incorporating physics-inspired representations like NFF can help bridge the gap between artificial and human physical reasoning capabilities by enabling efficient learning and generalization from minimal data.

Abstract: Physical reasoning is a remarkable human ability that enables rapid learning and generalization from limited experience. Current AI models, despite extensive training, still struggle to achieve similar generalization, especially in Out-of-distribution (OOD) settings. This limitation stems from their inability to abstract core physical principles from observations. A key challenge is developing representations that can efficiently learn and generalize physical dynamics from minimal data. Here we present Neural Force Field (NFF), a framework extending Neural Ordinary Differential Equation (NODE) to learn complex object interactions through force field representations, which can be efficiently integrated through an Ordinary Differential Equation (ODE) solver to predict object trajectories. Unlike existing approaches that rely on discrete latent spaces, NFF captures fundamental physical concepts such as gravity, support, and collision in continuous explicit force fields. Experiments on three challenging physical reasoning tasks demonstrate that NFF, trained with only a few examples, achieves strong generalization to unseen scenarios. This physics-grounded representation enables efficient forward-backward planning and rapid adaptation through interactive refinement. Our work suggests that incorporating physics-inspired representations into learning systems can help bridge the gap between artificial and human physical reasoning capabilities.

[281] Constraint Guided Model Quantization of Neural Networks

Quinten Van Baelen, Peter Karsmakers

Main category: cs.LG

TL;DR: CGMQ is a quantization aware training method that automatically produces mixed precision neural networks satisfying predefined computational constraints without hyperparameter tuning, achieving competitive performance on MNIST and CIFAR10.

Details

Motivation: Edge computing hardware has limited resources, making it difficult to run complex neural networks. Existing quantization methods require hyperparameter tuning to meet computational constraints.

Method: Constraint Guided Model Quantization (CGMQ) uses an upper bound on computational resources to reduce parameter bit-widths during quantization aware training, eliminating the need for hyperparameter tuning.

Result: CGMQ produces mixed precision neural networks that satisfy computational constraints while achieving competitive performance compared to state-of-the-art quantization methods on MNIST and CIFAR10 datasets.

Conclusion: CGMQ provides an effective quantization approach that guarantees satisfaction of computational constraints without hyperparameter tuning, making it suitable for resource-constrained edge deployment.

Abstract: Deploying neural networks on the edge has become increasingly important as deep learning is being applied in an increasing amount of applications. At the edge computing hardware typically has limited resources disallowing to run neural networks with high complexity. To reduce the complexity of neural networks a wide range of quantization methods have been proposed in recent years. This work proposes Constraint Guided Model Quantization (CGMQ), which is a quantization aware training algorithm that uses an upper bound on the computational resources and reduces the bit-widths of the parameters of the neural network. CGMQ does not require the tuning of a hyperparameter to result in a mixed precision neural network that satisfies the predefined computational cost constraint, while prior work does. It is shown on MNIST and CIFAR10 that the performance of CGMQ is competitive with state-of-the-art quantization aware training algorithms, while guaranteeing the satisfaction of an upper bound on the computational complexity defined by the computational resources of the on edge hardware.

[282] Auxiliary Discrminator Sequence Generative Adversarial Networks (ADSeqGAN) for Few Sample Molecule Generation

Haocheng Tang, Jing Long, Beihong Ji, Junmei Wang

Main category: cs.LG

TL;DR: ADSeqGAN is a novel GAN-based approach that integrates an auxiliary random forest classifier to improve molecular generation quality in small-sample datasets, demonstrating superior performance in generating nucleic acid binders, CNS drugs, and CB1 ligands compared to baseline models.

Details

Motivation: Traditional generative models struggle with limited training data in drug discovery, particularly for specific therapeutic targets like nucleic acid binders and CNS drugs where datasets are scarce.

Method: Integrates an auxiliary random forest classifier as an additional discriminator into the GAN framework, incorporates pretrained generator and Wasserstein distance to enhance training stability and diversity.

Result: Superior nucleic acid binder generation, improved CNS drug generation through oversampling, and generated novel CB1 ligands with 32.8% predicted actives surpassing hit rates of specialized libraries.

Conclusion: ADSeqGAN offers a versatile framework for molecular design in data-scarce scenarios with demonstrated success across multiple therapeutic target applications.

Abstract: In this work, we introduce Auxiliary Discriminator Sequence Generative Adversarial Networks (ADSeqGAN), a novel approach for molecular generation in small-sample datasets. Traditional generative models often struggle with limited training data, particularly in drug discovery, where molecular datasets for specific therapeutic targets, such as nucleic acids binders and central nervous system (CNS) drugs, are scarce. ADSeqGAN addresses this challenge by integrating an auxiliary random forest classifier as an additional discriminator into the GAN framework, significantly improves molecular generation quality and class specificity. Our method incorporates pretrained generator and Wasserstein distance to enhance training stability and diversity. We evaluate ADSeqGAN across three representative cases. First, on nucleic acid- and protein-targeting molecules, ADSeqGAN shows superior capability in generating nucleic acid binders compared to baseline models. Second, through oversampling, it markedly improves CNS drug generation, achieving higher yields than traditional de novo models. Third, in cannabinoid receptor type 1 (CB1) ligand design, ADSeqGAN generates novel druglike molecules, with 32.8% predicted actives surpassing hit rates of CB1-focused and general-purpose libraries when assessed by a target-specific LRIP-SF scoring function. Overall, ADSeqGAN offers a versatile framework for molecular design in data-scarce scenarios, with demonstrated applications in nucleic acid binders, CNS drugs, and CB1 ligands.

[283] Bayesian Sheaf Neural Networks

Patrick Gillespie, Layal Bou Hamdan, Ioannis Schizas, David L. Boothe, Vasileios Maroulas

Main category: cs.LG

TL;DR: Bayesian sheaf neural networks use variational learning with novel SO(n) distributions to improve robustness in heterophilic graph data representation.

Details

Motivation: Sheaf neural networks are effective for heterophilic graph data but can be overly sensitive to learned sheaf structures, requiring more robust learning approaches.

Method: Proposed Bayesian sheaf neural networks using variational learning with novel reparameterizable probability distributions on rotation group SO(n) via Cayley transform.

Result: Bayesian sheaf models achieve leading performance compared to baselines and show reduced sensitivity to hyperparameters in limited training data settings.

Conclusion: Variational Bayesian approach enhances robustness and performance of sheaf neural networks for heterophilic graph learning tasks.

Abstract: Equipping graph neural networks with a convolution operation defined in terms of a cellular sheaf offers advantages for learning expressive representations of heterophilic graph data. The most flexible approach to constructing the sheaf is to learn it as part of the network as a function of the node features. However, this leaves the network potentially overly sensitive to the learned sheaf. As a counter-measure, we propose a variational approach to learning cellular sheaves within sheaf neural networks, yielding an architecture we refer to as a Bayesian sheaf neural network. As part of this work, we define a novel family of reparameterizable probability distributions on the rotation group $SO(n)$ using the Cayley transform. We evaluate the Bayesian sheaf neural network on several graph datasets, and show that our Bayesian sheaf models achieve leading performance compared to baseline models and are less sensitive to the choice of hyperparameters under limited training data settings.

[284] Open-sci-ref-0.01: open and reproducible reference baselines for language model and dataset comparison

Marianna Nezhurina, Jörg Franke, Taishi Nakamura, Timur Carstensen, Niccolò Ajroldi, Ville Komulainen, David Salinas, Jenia Jitsev

Main category: cs.LG

TL;DR: Open-sci-ref is a family of dense transformer models (0.13B-1.7B parameters) trained on 8 open reference datasets to establish research baselines for comparing training approaches across scales.

Details

Motivation: To provide reference points for researchers to assess the sanity and quality of alternative training approaches across different model scales and datasets, enabling standardized comparison.

Method: Trained transformer models on 8 recent open reference datasets at multiple parameter scales (up to 1.7B) and token scales (up to 1T tokens), with intermediate checkpoints for studying training dynamics.

Result: NemoTron-CC HQ consistently outperformed other reference datasets, followed by DCLM-baseline and FineWeb-Edu. The models provide scaling trends for comparing training procedures on a common compute axis.

Conclusion: The release of models, intermediate checkpoints, logs, code, and evaluations simplifies reproduction, standardizes comparison, and facilitates future research in training methodology evaluation.

Abstract: We introduce open-sci-ref, a family of dense transformer models trained as research baselines across multiple model (0.13B to 1.7B parameters) and token scales (up to 1T) on 8 recent open reference datasets. Evaluating the models on various standardized benchmarks, our training runs set establishes reference points that enable researchers to assess the sanity and quality of alternative training approaches across scales and datasets. Intermediate checkpoints allow comparison and studying of the training dynamics. The established reference baselines allow training procedures to be compared through their scaling trends, aligning them on a common compute axis. Comparison of open reference datasets reveals that training on NemoTron-CC HQ consistently outperforms other reference datasets, followed by DCLM-baseline and FineWeb-Edu. In addition to intermediate training checkpoints, the release includes logs, code, and downstream evaluations to simplify reproduction, standardize comparison, and facilitate future research.

Zahraa Al Sahili, Ioannis Patras, Matthew Purver

Main category: cs.LG

TL;DR: This paper analyzes how model size, training data scale, and data source affect social biases in vision-language models like CLIP and OpenCLIP, finding that larger models/datasets don’t guarantee fairness and data source is the primary driver of bias patterns.

Details

Motivation: Vision-language models inherit social biases from training data, but the specific contributions of model size, data scale, and data source to bias patterns are not well understood and often assumed that bigger models/datasets are automatically fairer.

Method: Systematically compared CLIP and OpenCLIP models with identical contrastive objectives but different encoder widths and training data (400M proprietary pairs vs 400M/2B LAION pairs). Evaluated across balanced face-analysis benchmarks and tested three post-hoc debiasing strategies: Bias Prompts, Prompt Array, and SANER.

Result: Enlarging encoder reduces gender skew in CLIP but amplifies both gender and racial skew in OpenCLIP. Increasing LAION corpus from 400M to 2B increases OpenCLIP bias. Substituting proprietary data with LAION improves gender fairness but increases racial skew. Debiasing strategies reduce but don’t eliminate harm, with effectiveness being source- and size-dependent.

Conclusion: Bigger models or datasets are not automatically fairer. Training data source is the key determinant of both bias patterns and mitigation efficacy. The findings challenge assumptions about scaling and fairness in VLMs.

Abstract: Vision-language models (VLMs) deliver strong zero-shot recognition but frequently inherit social biases from their training data. We systematically disentangle three design factors – model size, training-data scale, and training-data source – by comparing CLIP and OpenCLIP, two models that share an identical contrastive objective yet differ in encoder width and in the image-text corpora on which they are pre-trained (400M proprietary pairs vs. 400M/2B LAION). Across balanced face-analysis benchmarks, enlarging the encoder reduces gender skew in CLIP but amplifies both gender and racial skew in OpenCLIP; increasing the LAION corpus from 400M to 2B further increases OpenCLIP bias. At matched model and data budgets, substituting proprietary data with LAION improves gender fairness while increasing racial skew, underscoring data source as the primary driver of bias patterns. We also evaluate three post-hoc, test-time debiasing strategies – Bias Prompts, Prompt Array, and SANER. Debiasing reduces but does not eliminate harm, and its effectiveness is source- and size-dependent: Bias Prompts most effectively reduce gender skew in CLIP at smaller model sizes, whereas Prompt Array and SANER more reliably reduce racial skew in OpenCLIP; scaling LAION reconfigures which method is most fair. Taken together, these findings challenge the assumption that bigger models or datasets are automatically fairer and foreground training data source as the key determinant of both bias and mitigation efficacy. We release code and evaluation scripts to enable transparent, reproducible auditing of future VLMs.

[286] A Comprehensive Survey on Imbalanced Data Learning

Xinyi Gao, Dongting Xie, Yihang Zhang, Zhengren Wang, Chong Chen, Conghui He, Hongzhi Yin, Wentao Zhang

Main category: cs.LG

TL;DR: Survey paper on machine learning with imbalanced data distributions, categorizing existing research into four approaches and providing overview of open-source tools and future challenges.

Details

Motivation: Imbalanced data distributions are prevalent in real-world data and severely hinder ML performance by biasing decision-making processes. The paper aims to deepen understanding and facilitate research on handling imbalanced data.

Method: Systematic analysis of various real-world data formats and categorization of existing research into four distinct categories: data re-balancing, feature representation, training strategy, and ensemble learning.

Result: Provides structured analysis to help researchers understand imbalance across diverse data formats, overview of relevant open-source libraries, and identification of current challenges.

Conclusion: The survey paves a clearer path toward achieving specific research goals and offers novel insights to foster future advancements in handling imbalanced data in machine learning.

Abstract: With the expansion of data availability, machine learning (ML) has achieved remarkable breakthroughs in both academia and industry. However, imbalanced data distributions are prevalent in various types of raw data and severely hinder the performance of ML by biasing the decision-making processes. To deepen the understanding of imbalanced data and facilitate the related research and applications, this survey systematically analyzes various real-world data formats and concludes existing researches for different data formats into four distinct categories: data re-balancing, feature representation, training strategy, and ensemble learning. This structured analysis helps researchers comprehensively understand the pervasive nature of imbalance across diverse data formats, thereby paving a clearer path toward achieving specific research goals. We provide an overview of relevant open-source libraries, spotlight current challenges, and offer novel insights aimed at fostering future advancements in this critical area of study.

[287] When and How Does CLIP Enable Domain and Compositional Generalization?

Elias Kempf, Simon Schrodi, Max Argus, Thomas Brox

Main category: cs.LG

TL;DR: CLIP’s generalization depends on domain diversity in training data, with compositional generalization being weaker than domain generalization when training data is suboptimal.

Details

Motivation: To understand if CLIP can generalize to unseen domains (domain generalization) and unseen classes within partially seen domains (compositional generalization), and identify factors affecting such generalization.

Method: Trained CLIP models on systematically constructed training distributions with controlled domain diversity and object class exposure, followed by data-centric and mechanistic analyses.

Result: Domain diversity is essential for both types of generalization, but compositional generalization can be surprisingly weaker than domain generalization when training data contains suboptimal subsets of test domains.

Conclusion: Successful generalization requires learning sufficiently shared representations in intermediate layers and circuits, highlighting the importance of domain diversity in training distributions.

Abstract: The remarkable generalization performance of contrastive vision-language models like CLIP is often attributed to the diversity of their training distributions. However, key questions remain unanswered: Can CLIP generalize to an entirely unseen domain when trained on a diverse mixture of domains (domain generalization)? Can it generalize to unseen classes within partially seen domains (compositional generalization)? What factors affect such generalization? To answer these questions, we trained CLIP models on systematically constructed training distributions with controlled domain diversity and object class exposure. Our experiments show that domain diversity is essential for both domain and compositional generalization, yet compositional generalization can be surprisingly weaker than domain generalization when the training distribution contains a suboptimal subset of the test domain. Through data-centric and mechanistic analyses, we find that successful generalization requires the learning of sufficiently shared representations in intermediate layers and circuits.

Liangqi Yuan, Dong-Jun Han, Shiqiang Wang, Christopher G. Brinton

Main category: cs.LG

TL;DR: TMO is a local-cloud LLM inference system that uses Three-M Offloading (Multi-modal, Multi-task, Multi-dialogue) with a reinforcement learning strategy to optimize where to process tasks (local vs cloud) for better performance, latency, and cost efficiency.

Details

Motivation: Large language models face deployment challenges: local devices struggle with computational/memory/energy constraints, while cloud deployment lacks real-time guarantees and incurs communication costs. A hybrid approach is needed to balance these trade-offs.

Method: TMO combines a lightweight local LLM for simple tasks and a large cloud LLM for complex multi-modal processing. It uses resource-constrained reinforcement learning (RCRL) to optimize inference location and data source selection for each task/dialogue.

Result: TMO significantly outperforms exploration-decision and LLM-as-Agent baselines, showing improvements in latency, cost, and response quality. The system also introduces M4A1 dataset for evaluating offloading decisions.

Conclusion: The Three-M Offloading approach with RCRL optimization effectively addresses LLM deployment challenges, providing a practical solution that balances local and cloud processing to maximize performance while respecting resource constraints.

Abstract: Compared to traditional machine learning models, recent large language models (LLMs) can exhibit multi-task-solving capabilities through multiple dialogues and multi-modal data sources. These unique characteristics of LLMs, together with their large model size, make their deployment more challenging. Specifically, (i) deploying LLMs on local devices faces computational, memory, and energy resource issues, while (ii) deploying them in the cloud cannot guarantee real-time service and incurs communication/usage costs. In this paper, we design TMO, a local-cloud LLM inference system with Three-M Offloading: Multi-modal, Multi-task, and Multi-dialogue. TMO incorporates (i) a lightweight local LLM that can process simple tasks at high speed and (ii) a large-scale cloud LLM that can handle multi-modal data sources. We develop a resource-constrained reinforcement learning (RCRL) strategy for TMO that optimizes the inference location (i.e., local vs. cloud) and multi-modal data sources to use for each task/dialogue, aiming to maximize the long-term reward (response quality, latency, and usage cost) while adhering to resource constraints. We also contribute M4A1, a new dataset we curated that contains reward and cost metrics across multiple modality, task, dialogue, and LLM configurations, enabling evaluation of offloading decisions. We demonstrate the effectiveness of TMO compared to several exploration-decision and LLM-as-Agent baselines, showing significant improvements in latency, cost, and response quality.

[289] AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, Yi Wu

Main category: cs.LG

TL;DR: AReaL is an asynchronous RL system for LLMs that decouples generation from training, achieving 2.77× speedup with same GPU count and maintaining performance.

Details

Motivation: Synchronous RL systems for LLMs suffer from GPU underutilization as generation must wait for the longest output in batch completion before model updates.

Method: Fully asynchronous system with continuous rollout generation, staleness control, and staleness-enhanced PPO variant to handle outdated training samples.

Result: Achieves up to 2.77× training speedup compared to synchronous systems with same GPU resources, with matched or improved final performance on math and code reasoning benchmarks.

Conclusion: AReaL demonstrates that asynchronous RL training with proper staleness management can significantly improve system efficiency while maintaining training stability and performance.

Abstract: Reinforcement learning (RL) has become a dominant paradigm for training large language models (LLMs), particularly for reasoning tasks. Effective RL for LLMs requires massive parallelization and poses an urgent need for efficient training systems. Most existing large-scale RL systems for LLMs are synchronous, alternating generation and training in a batch setting where rollouts in each training batch are generated by the same model. This approach stabilizes RL training but suffers from severe system-level inefficiency: generation must wait until the longest output in the batch is completed before model updates, resulting in GPU underutilization. We present AReaL, a fully asynchronous RL system that completely decouples generation from training. Rollout workers in AReaL continuously generate new outputs without waiting, while training workers update the model whenever a batch of data is collected. AReaL also incorporates a collection of system-level optimizations, leading to substantially higher GPU utilization. To stabilize RL training, AReaL balances the workload of rollout and training workers to control data staleness, and adopts a staleness-enhanced PPO variant to better handle outdated training samples. Extensive experiments on math and code reasoning benchmarks show that AReaL achieves up to 2.77$\times$ training speedup compared to synchronous systems with the same number of GPUs and matched or improved final performance. The code of AReaL is available at https://github.com/inclusionAI/AReaL/.

[290] A Unified Framework for Diffusion Bridge Problems: Flow Matching and Schrödinger Matching into One

Minyoung Kim

Main category: cs.LG

TL;DR: This paper provides a unified framework that subsumes various bridge problem algorithms including Flow Matching, optimal transport FM, Schr"{o}dinger bridge FM, and deep Schr"{o}dinger bridge matching into one general perspective.

Details

Motivation: The bridge problem (finding SDE/ODE connecting two distributions) has enormous applications in generative modeling, but existing algorithms like flow matching and iterative fitting approaches seem unrelated and are confined to specific problem types.

Method: The authors propose a novel unified framework that can instantiate multiple bridge algorithms as special cases, providing concise reviews of existing methods with technical details.

Result: The unified framework successfully subsumes Flow Matching, optimal transport FM, Schr"{o}dinger bridge FM, and deep Schr"{o}dinger bridge matching algorithms as its special cases.

Conclusion: This unified perspective provides a more general and flexible view of bridge problems, which will help researchers develop new bridge algorithms across different application fields.

Abstract: The bridge problem is to find an SDE (or sometimes an ODE) that bridges two given distributions. The application areas of the bridge problem are enormous, among which the recent generative modeling (e.g., conditional or unconditional image generation) is the most popular. Also the famous Schr"{o}dinger bridge problem, a widely known problem for a century, is a special instance of the bridge problem. Two most popular algorithms to tackle the bridge problems in the deep learning era are: (conditional) flow matching and iterative fitting algorithms, where the former confined to ODE solutions, and the latter specifically for the Schr"{o}dinger bridge problem. The main contribution of this article is in two folds: i) We provide concise reviews of these algorithms with technical details to some extent; ii) We propose a novel unified perspective and framework that subsumes these seemingly unrelated algorithms (and their variants) into one. In particular, we show that our unified framework can instantiate the Flow Matching (FM) algorithm, the (mini-batch) optimal transport FM algorithm, the (mini-batch) Schr"{o}dinger bridge FM algorithm, and the deep Schr"{o}dinger bridge matching (DSBM) algorithm as its special cases. We believe that this unified framework will be useful for viewing the bridge problems in a more general and flexible perspective, and in turn can help researchers and practitioners to develop new bridge algorithms in their fields.

[291] HiLight: A Hierarchical Reinforcement Learning Framework with Global Adversarial Guidance for Large-Scale Traffic Signal Control

Yaqiao Zhu, Hongkai Wen, Geyong Min, Man Luo

Main category: cs.LG

TL;DR: HiLight is a hierarchical RL framework with adversarial training for large-scale traffic signal control, using global meta-policy and local sub-policies to achieve better coordination and scalability.

Details

Motivation: Existing RL methods for traffic signal control struggle with scalability in large networks while maintaining global coordination - centralized approaches have scalability issues and decentralized methods lack unified objectives.

Method: Hierarchical RL framework with Meta-Policy (Transformer-LSTM) that partitions network and generates sub-goals, and Sub-Policy that controls intersections. Uses adversarial training where Meta-Policy creates challenging sub-goals and Sub-Policy learns to surpass them.

Result: HiLight shows significant advantages in large-scale scenarios and remains competitive across standard benchmarks of varying sizes, performing well under diverse traffic conditions including peak transitions, adverse weather, and holiday surges.

Conclusion: The hierarchical framework with adversarial guidance effectively addresses scalability and coordination challenges in large-scale traffic signal control, demonstrating robust performance across various network sizes and traffic conditions.

Abstract: Efficient traffic signal control (TSC) is essential for mitigating urban congestion, yet existing reinforcement learning (RL) methods face challenges in scaling to large networks while maintaining global coordination. Centralized RL suffers from scalability issues, while decentralized approaches often lack unified objectives, resulting in limited network-level efficiency. In this paper, we propose HiLight, a hierarchical reinforcement learning framework with global adversarial guidance for large-scale TSC. HiLight consists of a high-level Meta-Policy, which partitions the traffic network into subregions and generates sub-goals using a Transformer-LSTM architecture, and a low-level Sub-Policy, which controls individual intersections with global awareness. To improve the alignment between global planning and local execution, we introduce an adversarial training mechanism, where the Meta-Policy generates challenging yet informative sub-goals, and the Sub-Policy learns to surpass these targets, leading to more effective coordination. We evaluate HiLight across both synthetic and real-world benchmarks, and additionally construct a large-scale Manhattan network with diverse traffic conditions, including peak transitions, adverse weather, and holiday surges. Experimental results show that HiLight exhibits significant advantages in large-scale scenarios and remains competitive across standard benchmarks of varying sizes.

[292] Learning Value of Information towards Joint Communication and Control in 6G V2X

Lei Lei, Kan Zheng, Xuemin, Shen

Main category: cs.LG

TL;DR: The paper introduces Sequential Stochastic Decision Process (SSDP) models to define and assess Value of Information (VoI) for optimizing communication systems in Connected Autonomous Vehicles, bridging vehicle control and V2X communication.

Details

Motivation: As C-V2X evolves towards 6G networks, CAVs require enhanced decision-making under uncertainty. VoI serves as a crucial bridge between vehicle control and V2X communication, but current research remains fragmented.

Method: Proposes SSDP models that generalize MDPs, explicitly representing information that enhances decision-making. Develops a systematic VoI modeling framework based on MDP, RL and Optimal Control theories, with VoI categories and estimation methods.

Result: SSDP models with VoI-associated reward functions enable optimization of “When”, “What”, and “How” to communicate. Demonstrated through a vehicle-following control problem, showing potential for joint optimization of control and communication decisions.

Conclusion: The SSDP framework and VoI modeling approach provide a structured methodology for optimizing communication systems in CAVs, with significant potential for broader networked control systems applications.

Abstract: As Cellular Vehicle-to-Everything (C-V2X) evolves towards future sixth-generation (6G) networks, Connected Autonomous Vehicles (CAVs) are emerging to become a key application. Leveraging data-driven Machine Learning (ML), especially Deep Reinforcement Learning (DRL), is expected to significantly enhance CAV decision-making in both vehicle control and V2X communication under uncertainty. These two decision-making processes are closely intertwined, with the value of information (VoI) acting as a crucial bridge between them. In this paper, we introduce Sequential Stochastic Decision Process (SSDP) models to define and assess VoI, demonstrating their application in optimizing communication systems for CAVs. Specifically, we formally define the SSDP model and demonstrate that the MDP model is a special case of it. The SSDP model offers a key advantage by explicitly representing the set of information that can enhance decision-making when available. Furthermore, as current research on VoI remains fragmented, we propose a systematic VoI modeling framework grounded in the MDP, Reinforcement Learning (RL) and Optimal Control theories. We define different categories of VoI and discuss their corresponding estimation methods. Finally, we present a structured approach to leverage the various VoI metrics for optimizing the When", What", and ``How" to communicate problems. For this purpose, SSDP models are formulated with VoI-associated reward functions derived from VoI-based optimization objectives. While we use a simple vehicle-following control problem to illustrate the proposed methodology, it holds significant potential to facilitate the joint optimization of stochastic, sequential control and communication decisions in a wide range of networked control systems.

[293] Multivariate Long-term Time Series Forecasting with Fourier Neural Filter

Chenheng Xu, Dan Wu, Yixin Zhu, Ying Nian Wu

Main category: cs.LG

TL;DR: FNF backbone with DBD architecture achieves state-of-the-art multivariate time series forecasting by unifying local time-domain and global frequency-domain processing, outperforming existing methods across 11 benchmark datasets without auxiliary techniques.

Details

Motivation: Current approaches repurpose NLP/CV backbones like Transformers that fail to address unique time series properties (e.g., periodicity), lacking dedicated temporal-specific inductive biases for spatio-temporal modeling.

Method: Introduces FNF as the backbone that unifies local time-domain and global frequency-domain information processing, and DBD architecture that provides superior gradient flow and representation capacity for spatio-temporal modeling.

Result: Achieves state-of-the-art performance across 11 public benchmark datasets spanning energy, meteorology, transportation, environment, and nature domains with consistent hyperparameter settings.

Conclusion: Properly designed neural architectures can capture inherent time series properties without auxiliary techniques, potentially transforming time series modeling in scientific and industrial applications.

Abstract: Multivariate long-term time series forecasting has been suffering from the challenge of capturing both temporal dependencies within variables and spatial correlations across variables simultaneously. Current approaches predominantly repurpose backbones from natural language processing or computer vision (e.g., Transformers), which fail to adequately address the unique properties of time series (e.g., periodicity). The research community lacks a dedicated backbone with temporal-specific inductive biases, instead relying on domain-agnostic backbones supplemented with auxiliary techniques (e.g., signal decomposition). We introduce FNF as the backbone and DBD as the architecture to provide excellent learning capabilities and optimal learning pathways for spatio-temporal modeling, respectively. Our theoretical analysis proves that FNF unifies local time-domain and global frequency-domain information processing within a single backbone that extends naturally to spatial modeling, while information bottleneck theory demonstrates that DBD provides superior gradient flow and representation capacity compared to existing unified or sequential architectures. Our empirical evaluation across 11 public benchmark datasets spanning five domains (energy, meteorology, transportation, environment, and nature) confirms state-of-the-art performance with consistent hyperparameter settings. Notably, our approach achieves these results without any auxiliary techniques, suggesting that properly designed neural architectures can capture the inherent properties of time series, potentially transforming time series modeling in scientific and industrial applications.

Ziyi Chen, Yiyang Liu, Mattia Prosperi, Krishna Vaddiparti, Robert L Cook, Jiang Bian, Yi Guo, Yonghui Wu

Main category: cs.LG

TL;DR: Used NLP and topic modeling on EHR notes to identify HIV-related stigma themes and social contexts in 9140 patients, revealing topics like mental health stigma and treatment refusal across different demographics.

Details

Motivation: To overcome limitations of traditional questionnaires and enable scalable assessment of HIV-related stigma and associated social/behavioral circumstances from electronic health records.

Method: Identified PLWH cohort from EHR data, used Latent Dirichlet Allocation topic modeling with iterative keyword expansion (91 stigma keywords), applied three filtering strategies, and conducted word frequency and subgroup analysis across age and sex demographics.

Result: Identified diverse stigma themes including “Mental Health Concern, Stigma”, “Treatment Refusal, Isolation”, and “Substance Abuse” from 2.9 million clinical notes. Topic variation analysis revealed substantial differences across age subgroups.

Conclusion: NLP methods applied to EHR clinical notes provide scalable, time-efficient assessment of HIV-related stigma, offering actionable insights to improve patient care and HIV-care outcomes.

Abstract: Objective: To characterize stigma dimensions, social, and related behavioral circumstances in people living with HIV(PLWHs) seeking care, using NLP methods applied to a large collection of EHR clinical notes from a large integrated health system in the southeast United States. Methods: We identified a cohort of PLWHs from the UF Health IDR and performed topic modeling analysis using Latent Dirichlet Allocation to uncover stigma-related dimensions and related social and behavioral contexts. Domain experts created a seed list of HIV-related stigma keywords, then applied a snowball strategy to review notes for additional terms until saturation was reached iteratively. To identify more target topics, we tested three keyword-based filtering strategies. The detected topics were evaluated using three widely used metrics and manually reviewed by specialists. In addition, we conducted word frequency analysis and topic variation analysis among subgroups to examine differences across age and sex-specific demographics. Results: We identified 9140 PLWHs at UF Health and collected 2.9 million clinical notes. Through the iterative keyword approach, we generated a list of 91 keywords associated with HIV-related stigma. Topic modeling on sentences containing at least one keyword uncovered a wide range of topic themes, such as “Mental Health Concern, Stigma”, “Treatment Refusal, Isolation”, and “Substance Abuse”. Topic variation analysis across age subgroups revealed substantial differences. Conclusion: Extracting and understanding the HIV-related stigma and associated social and behavioral circumstances from EHR clinical notes enables scalable, time-efficient assessment and overcoming the limitations of traditional questionnaires. Findings from this research provide actionable insights to inform patient care and interventions to improve HIV-care outcomes.

[295] Atherosclerosis through Hierarchical Explainable Neural Network Analysis

Irsyad Adam, Steven Swee, Erika Yilin, Ethan Ji, William Speier, Dean Wang, Alex Bui, Wei Wang, Karol Watson, Peipei Ping

Main category: cs.LG

TL;DR: ATHENA is a hierarchical graph neural network that integrates clinical cohort features and individual molecular data to improve subclinical atherosclerosis classification by 13% AUC and 20% F1 score.

Details

Motivation: Current graph-based disease classification methods lack consistency in handling cohort-wide clinical features and fail to integrate shared pathogenic interdependencies among patients, limiting understanding of atherosclerotic phenotypes.

Method: Developed ATHENA framework that constructs hierarchical network representation through integrated modality learning, optimizing patient-specific molecular fingerprints while enforcing consistency with cohort-wide patterns using a dataset of 391 patients.

Result: Significantly boosted classification performance by up to 13% in AUC and 20% in F1 score compared to various baselines, enabling mechanistically-informed patient subtype discovery through explainable AI-driven subnetwork clustering.

Conclusion: ATHENA’s novel integration framework strengthens personalized intervention strategies and improves prediction of atherosclerotic disease progression and management of clinical outcomes.

Abstract: In this work, we study the problem pertaining to personalized classification of subclinical atherosclerosis by developing a hierarchical graph neural network framework to leverage two characteristic modalities of a patient: clinical features within the context of the cohort, and molecular data unique to individual patients. Current graph-based methods for disease classification detect patient-specific molecular fingerprints, but lack consistency and comprehension regarding cohort-wide features, which are an essential requirement for understanding pathogenic phenotypes across diverse atherosclerotic trajectories. Furthermore, understanding patient subtypes often considers clinical feature similarity in isolation, without integration of shared pathogenic interdependencies among patients. To address these challenges, we introduce ATHENA: Atherosclerosis Through Hierarchical Explainable Neural Network Analysis, which constructs a novel hierarchical network representation through integrated modality learning; subsequently, it optimizes learned patient-specific molecular fingerprints that reflect individual omics data, enforcing consistency with cohort-wide patterns. With a primary clinical dataset of 391 patients, we demonstrate that this heterogeneous alignment of clinical features with molecular interaction patterns has significantly boosted subclinical atherosclerosis classification performance across various baselines by up to 13% in area under the receiver operating curve (AUC) and 20% in F1 score. Taken together, ATHENA enables mechanistically-informed patient subtype discovery through explainable AI (XAI)-driven subnetwork clustering; this novel integration framework strengthens personalized intervention strategies, thereby improving the prediction of atherosclerotic disease progression and management of their clinical actionable outcomes.

[296] FedFitTech: A Baseline in Federated Learning for Fitness Tracking

Zeyneddin Oz, Shreyas Korde, Marius Bock, Kristof Van Laerhoven

Main category: cs.LG

TL;DR: FedFitTech baseline framework for federated learning in fitness technology that addresses privacy concerns while reducing communication overhead by 13% with minimal performance impact.

Details

Motivation: Traditional centralized learning approaches for fitness activity detection face privacy concerns, regulatory restrictions, and communication inefficiencies. Federated Learning offers decentralized training but presents challenges like data imbalance and personalization trade-offs in FitTech applications.

Method: Developed FedFitTech baseline under Flower framework with client-side early stopping strategy. The system enables wearable devices to optimize trade-offs between capturing common activities and preserving individual nuances through decentralized model training.

Result: Reduces overall redundant communications by 13% while maintaining recognition performance with only 1% recognition cost. Enhances scalability and efficiency of privacy-aware fitness tracking applications.

Conclusion: FedFitTech creates foundation for new research opportunities in FitTech, available as open source. Successfully addresses privacy concerns while improving communication efficiency in federated learning for wearable fitness applications.

Abstract: The rapid evolution of sensors and resource-efficient machine learning models has spurred the widespread adoption of wearable fitness tracking devices. Equipped with inertial sensors, such devices can continuously capture physical movements for fitness technology (FitTech), enabling applications from sports optimization to preventive healthcare. Traditional Centralized Learning approaches to detect fitness activities struggle with data privacy concerns, regulatory restrictions, and communication inefficiencies. In contrast, Federated Learning (FL) enables a decentralized model training by communicating model updates rather than potentially private wearable sensor data. Applying FL to FitTech presents unique challenges, such as data imbalance, lack of labeled data, heterogeneous user activities, and trade-offs between personalization and generalization. To simplify research on FitTech in FL, we present the FedFitTech baseline, under the Flower framework, which is publicly available and widely used by both industry and academic researchers. Additionally, to illustrate its usage, this paper presents a case study that implements a system based on the FedFitTech baseline, incorporating a client-side early stopping strategy and comparing the results. For instance, this system allows wearable devices to optimize the trade-off between capturing common fitness activities and preserving individuals’ nuances, thereby enhancing both the scalability and efficiency of privacy-aware fitness tracking applications. The results show that this reduces the overall redundant communications by 13%, while maintaining the overall recognition performance at a negligible recognition cost by 1%. Thus, the FedFitTech baseline creates a foundation for a wide range of new research and development opportunities in FitTech, and it is available as open source at: https://github.com/shreyaskorde16/FedFitTech

[297] Leveraging Data Augmentation and Siamese Learning for Predictive Process Monitoring

Sjoerd van Straten, Alessandro Padella, Marwan Hassani

Main category: cs.LG

TL;DR: SiamSA-PPM is a self-supervised learning framework that combines Siamese learning with statistical augmentation to address data scarcity in predictive process monitoring, achieving state-of-the-art performance.

Details

Motivation: Deep learning approaches for predictive process monitoring are limited by low variability and small size of real-world event logs, creating a need for effective data enrichment methods.

Method: Uses three novel statistically grounded transformation methods that leverage control-flow semantics and frequent behavioral patterns to generate realistic trace variants, combined with Siamese learning for self-supervised representation learning.

Result: Achieves competitive or superior performance compared to state-of-the-art methods in both next activity and final outcome prediction tasks, with statistical augmentation significantly outperforming random transformations.

Conclusion: SiamSA-PPM represents a promising direction for training data enrichment in process prediction, effectively addressing data scarcity issues through semantically valid augmentation.

Abstract: Predictive Process Monitoring (PPM) enables forecasting future events or outcomes of ongoing business process instances based on event logs. However, deep learning PPM approaches are often limited by the low variability and small size of real-world event logs. To address this, we introduce SiamSA-PPM, a novel self-supervised learning framework that combines Siamese learning with Statistical Augmentation for Predictive Process Monitoring. It employs three novel statistically grounded transformation methods that leverage control-flow semantics and frequent behavioral patterns to generate realistic, semantically valid new trace variants. These augmented views are used within a Siamese learning setup to learn generalizable representations of process prefixes without the need for labeled supervision. Extensive experiments on real-life event logs demonstrate that SiamSA-PPM achieves competitive or superior performance compared to the SOTA in both next activity and final outcome prediction tasks. Our results further show that statistical augmentation significantly outperforms random transformations and improves variability in the data, highlighting SiamSA-PPM as a promising direction for training data enrichment in process prediction.

[298] EB-gMCR: Energy-Based Generative Modeling for Signal Unmixing and Multivariate Curve Resolution

Yu-Tang Chang, Shih-Fang Chen

Main category: cs.LG

TL;DR: EB-gMCR reformulates multivariate curve resolution as a data generative process with an energy-based solver that automatically discovers the smallest component set and their concentrations, outperforming traditional matrix factorization approaches in scalability and accuracy.

Details

Motivation: Classical MCR approaches based on matrix factorization require user-specified component numbers (often unknown) and face scalability challenges with increasing data or component counts.

Method: Reformulates MCR as a data generative process (gMCR) and introduces an Energy-Based solver (EB-gMCR) that automatically discovers the smallest component set and their concentrations for faithful signal reconstruction.

Result: On synthetic benchmarks with up to 256 components, EB-gMCR achieves high reconstruction fidelity, recovers component count within 5% at 20dB noise and near-exact at 30dB. On public spectral datasets, it identifies correct component count and improves separation over MF-based approaches.

Conclusion: EB-gMCR is a general solver for fixed-pattern signal unmixing that can incorporate domain priors as plug-in modules, enabling adaptation to new instruments/domains without altering the core selection learning process.

Abstract: Signal unmixing analysis decomposes data into basic patterns and is widely applied in chemical and biological research. Multivariate curve resolution (MCR), a branch of signal unmixing, separates mixed signals into components (base patterns) and their concentrations (intensity), playing a key role in understanding composition. Classical MCR is typically framed as matrix factorization (MF) and requires a user-specified number of components, usually unknown in real data. Once data or component number increases, the scalability of these MCR approaches face significant challenges. This study reformulates MCR as a data generative process (gMCR), and introduces an Energy-Based solver, EB-gMCR, that automatically discovers the smallest component set and their concentrations for reconstructing the mixed signals faithfully. On synthetic benchmarks with up to 256 components, EB-gMCR attains high reconstruction fidelity and recovers the component count within 5% at 20dB noise and near-exact at 30dB. On two public spectral datasets, it identifies the correct component count and improves component separation over MF-based MCR approaches (NMF variants, ICA, MCR-ALS). EB-gMCR is a general solver for fixed-pattern signal unmixing (components remain invariant across mixtures). Domain priors (non-negativity, nonlinear mixing) enter as plug-in modules, enabling adaptation to new instruments or domains without altering the core selection learning step. The source code is available at https://github.com/b05611038/ebgmcr_solver.

[299] A Dataset for Distilling Knowledge Priors from Literature for Therapeutic Design

Haydn Thomas Jones, Natalie Maus, Josh Magnus Ludan, Maggie Ziyu Huan, Jiaming Liang, Marcelo Der Torossian Torres, Jiatao Liang, Zachary Ives, Yoseph Barash, Cesar de la Fuente-Nunez, Jacob R. Gardner, Mark Yatskar

Main category: cs.LG

TL;DR: Medex is a large dataset of 32.3 million therapeutic compound priors extracted from literature using LLM pipelines, which enables training smaller models that outperform much larger models on therapeutic design tasks and produces safer molecule proposals.

Details

Motivation: AI-driven discovery often violates implicit constraints due to lack of experimental priors, with over 60% of proposed molecules having high mutagenic probability. There's a need for incorporating real-world experimental knowledge into design models.

Method: Constructed Medex dataset using LLM pipelines to extract therapeutic compound information from literature, creating natural language facts paired with entity representations (SMILES/refseq IDs). Trained LLM, CLIP, and LLava architectures to reason jointly about text and design targets.

Result: Models with 15M parameters pretrained on Medex outperform 2B TxGemma on TDC regression/classification tasks and perform comparably to 9B models. Medex-constrained optimization produces safer molecules in GuacaMol while maintaining effectiveness.

Conclusion: Medex provides effective experimental priors for therapeutic design, enabling smaller models to achieve superior performance and safer molecule proposals. The dataset will be expanded as literature grows.

Abstract: AI-driven discovery can greatly reduce design time and enhance new therapeutics’ effectiveness. Models using simulators explore broad design spaces but risk violating implicit constraints due to a lack of experimental priors. For example, in a new analysis we performed on a diverse set of models on the GuacaMol benchmark using supervised classifiers, over 60% of molecules proposed had high probability of being mutagenic. In this work, we introduce Medex, a dataset of priors for design problems extracted from literature describing compounds used in lab settings. It is constructed with LLM pipelines for discovering therapeutic entities in relevant paragraphs and summarizing information in concise fair-use facts. Medex consists of 32.3 million pairs of natural language facts, and appropriate entity representations (i.e. SMILES or refseq IDs). To demonstrate the potential of the data, we train LLM, CLIP, and LLava architectures to reason jointly about text and design targets and evaluate on tasks from the Therapeutic Data Commons (TDC). Medex is highly effective for creating models with strong priors: in supervised prediction problems that use our data as pretraining, our best models with 15M learnable parameters outperform larger 2B TxGemma on both regression and classification TDC tasks, and perform comparably to 9B models on average. Models built with Medex can be used as constraints while optimizing for novel molecules in GuacaMol, resulting in proposals that are safer and nearly as effective. We release our dataset at https://huggingface.co/datasets/medexanon/Medex, and will provide expanded versions as available literature grows.

[300] DE-VAE: Revealing Uncertainty in Parametric and Inverse Projections with Variational Autoencoders using Differential Entropy

Frederik L. Dennig, Daniel A. Keim

Main category: cs.LG

TL;DR: DE-VAE is an uncertainty-aware variational autoencoder that uses differential entropy to create parametric and invertible 2D projections, addressing poor performance with out-of-distribution samples while maintaining projection accuracy comparable to other AE methods.

Details

Motivation: Existing autoencoder methods perform poorly when dealing with out-of-distribution samples in data or embedding space, limiting their effectiveness for creating reliable parametric and invertible projections.

Method: DE-VAE uses differential entropy in a variational autoencoder framework to learn both a mapping from original space to 2D space and an inverse mapping back, trained with a fixed projection approach.

Result: Quantitative and qualitative evaluations on four datasets show DE-VAE achieves comparable accuracy to other AE-based approaches while enabling embedding uncertainty analysis.

Conclusion: DE-VAE successfully creates parametric and invertible projections with improved handling of out-of-distribution samples and provides uncertainty analysis capabilities, making it a valuable tool for multidimensional data projection.

Abstract: Recently, autoencoders (AEs) have gained interest for creating parametric and invertible projections of multidimensional data. Parametric projections make it possible to embed new, unseen samples without recalculating the entire projection, while invertible projections allow the synthesis of new data instances. However, existing methods perform poorly when dealing with out-of-distribution samples in either the data or embedding space. Thus, we propose DE-VAE, an uncertainty-aware variational AE using differential entropy (DE) to improve the learned parametric and invertible projections. Given a fixed projection, we train DE-VAE to learn a mapping into 2D space and an inverse mapping back to the original space. We conduct quantitative and qualitative evaluations on four well-known datasets, using UMAP and t-SNE as baseline projection methods. Our findings show that DE-VAE can create parametric and inverse projections with comparable accuracy to other current AE-based approaches while enabling the analysis of embedding uncertainty.

[301] Counterfactual Probabilistic Diffusion with Expert Models

Wenhao Mu, Zhi Cao, Mehmed Uludag, Alexander Rodríguez

Main category: cs.LG

TL;DR: ODE-Diff: A time series diffusion framework that combines expert mechanistic models with data-driven approaches for more reliable counterfactual distribution prediction in dynamical systems.

Details

Motivation: Existing methods for predicting counterfactual distributions in complex systems rely on point estimates or purely data-driven models, which struggle with data scarcity and lack interpretability.

Method: Proposes ODE-Diff, a diffusion-based framework that incorporates guidance from imperfect expert models by extracting high-level signals as structured priors for generative modeling, bridging mechanistic and data-driven approaches.

Result: ODE-Diff consistently outperforms strong baselines in both point prediction and distributional accuracy across semi-synthetic COVID-19 simulations, synthetic pharmacological dynamics, and real-world case studies.

Conclusion: The method enables more reliable and interpretable causal inference by effectively combining expert knowledge with data-driven generative modeling in time series analysis.

Abstract: Predicting counterfactual distributions in complex dynamical systems is essential for scientific modeling and decision-making in domains such as public health and medicine. However, existing methods often rely on point estimates or purely data-driven models, which tend to falter under data scarcity. We propose a time series diffusion-based framework that incorporates guidance from imperfect expert models by extracting high-level signals to serve as structured priors for generative modeling. Our method, ODE-Diff, bridges mechanistic and data-driven approaches, enabling more reliable and interpretable causal inference. We evaluate ODE-Diff across semi-synthetic COVID-19 simulations, synthetic pharmacological dynamics, and real-world case studies, demonstrating that it consistently outperforms strong baselines in both point prediction and distributional accuracy.

[302] DivMerge: A divergence-based model merging method for multi-tasking

Brahim Touayouch, Loïc Fosse, Géraldine Damnati, Gwénolé Lecorvé

Main category: cs.LG

TL;DR: A robust model merging method using Jensen-Shannon divergence to combine multiple task-specific models into one without additional labeled data, maintaining performance across all tasks even as task count increases.

Details

Motivation: To address task interference in multi-task learning when merging multiple fine-tuned models, which worsens with increasing task numbers, and to create a single model that maintains strong performance across all tasks without requiring additional labeled data.

Method: Leverages Jensen-Shannon divergence to guide the model merging process, automatically balances task importance during merging, and operates without needing additional labeled data.

Result: The approach remains robust as the number of tasks grows and consistently outperforms prior work in model merging via task arithmetic.

Conclusion: Proposed method effectively merges multiple task-specific models into a single robust model that maintains performance across all tasks, addressing task interference challenges in multi-task learning.

Abstract: Multi-task learning (MTL) is often achieved by merging datasets before fine-tuning, but the growing availability of fine-tuned models has led to new approaches such as model merging via task arithmetic. A major challenge in this setting is task interference, which worsens as the number of tasks increases. We propose a method that merges models trained on different tasks into a single model, maintaining strong performance across all tasks. Our approach leverages Jensen-Shannon divergence to guide the merging process without requiring additional labelled data, and automatically balances task importance. Unlike existing methods, our approach remains robust as the number of tasks grows and consistently outperforms prior work.

[303] K2-Think: A Parameter-Efficient Reasoning System

Zhoujun Cheng, Richard Fan, Shibo Hao, Taylor W. Killian, Haonan Li, Suqi Sun, Hector Ren, Alexander Moreno, Daqian Zhang, Tianjun Zhong, Yuxin Xiong, Yuanzhe Hu, Yutao Xie, Xudong Han, Yuqi Wang, Varad Pimpalkhute, Yonghao Zhuang, Aaryamonvikram Singh, Xuezhi Liang, Anze Xie, Jianshu She, Desai Fan, Chengqian Gao, Liqun Ma, Mikhail Yurochkin, John Maggs, Xuezhe Ma, Guowei He, Zhiting Hu, Zhengzhong Liu, Eric P. Xing

Main category: cs.LG

TL;DR: K2-Think is a 32B parameter reasoning system that matches or surpasses much larger models through advanced post-training and test-time computation techniques, achieving state-of-the-art performance in mathematical reasoning and strong results in other domains.

Details

Motivation: To demonstrate that smaller models can compete with state-of-the-art systems through integrated post-training techniques and inference optimizations, making high-performance reasoning systems more accessible and affordable.

Method: Built on Qwen2.5 base model with six technical pillars: Long Chain-of-thought Supervised Finetuning, Reinforcement Learning with Verifiable Rewards (RLVR), Agentic planning prior to reasoning, Test-time Scaling, Speculative Decoding, and Inference-optimized Hardware using publicly available datasets.

Result: Achieves state-of-the-art scores on public benchmarks for open-source models in mathematical reasoning, while performing strongly in Code and Science domains. Matches or surpasses performance of much larger models like GPT-OSS 120B and DeepSeek v3.1.

Conclusion: A 32B parameter model can compete with state-of-the-art systems through integrated post-training recipes and inference-time enhancements, making open-source reasoning systems more accessible with best-in-class inference speeds over 2,000 tokens per second.

Abstract: K2-Think is a reasoning system that achieves state-of-the-art performance with a 32B parameter model, matching or surpassing much larger models like GPT-OSS 120B and DeepSeek v3.1. Built on the Qwen2.5 base model, our system shows that smaller models can compete at the highest levels by combining advanced post-training and test-time computation techniques. The approach is based on six key technical pillars: Long Chain-of-thought Supervised Finetuning, Reinforcement Learning with Verifiable Rewards (RLVR), Agentic planning prior to reasoning, Test-time Scaling, Speculative Decoding, and Inference-optimized Hardware, all using publicly available open-source datasets. K2-Think excels in mathematical reasoning, achieving state-of-the-art scores on public benchmarks for open-source models, while also performing strongly in other areas such as Code and Science. Our results confirm that a more parameter-efficient model like K2-Think 32B can compete with state-of-the-art systems through an integrated post-training recipe that includes long chain-of-thought training and strategic inference-time enhancements, making open-source reasoning systems more accessible and affordable. K2-Think is freely available at k2think.ai, offering best-in-class inference speeds of over 2,000 tokens per second per request via the Cerebras Wafer-Scale Engine.

[304] Adaptive Rainfall Forecasting from Multiple Geographical Models Using Matrix Profile and Ensemble Learning

Dung T. Tran, Huyen Ngoc Huyen, Hong Nguyen, Xuan-Vu Phan, Nam-Phong Nguyen

Main category: cs.LG

TL;DR: MPWE framework uses matrix profile analysis and redundancy-aware weighting to dynamically combine multiple geographical rainfall forecasts, achieving better accuracy and stability across Vietnamese river basins.

Details

Motivation: Rainfall forecasting in Vietnam is challenging due to diverse climatic conditions and geographical variability, but accurate forecasts are crucial for flood management, hydropower operation, and disaster preparedness.

Method: Matrix Profile-based Weighted Ensemble (MPWE) - a regime-switching framework that dynamically captures covariant dependencies among multiple geographical model forecasts with redundancy-aware weighting to balance contributions across models.

Result: MPWE consistently achieves lower mean and standard deviation of prediction errors compared to geographical models and ensemble baselines across eight major basins and five forecast horizons (1 hour to 84 hours).

Conclusion: The proposed MPWE framework demonstrates both improved accuracy and stability in rainfall forecasting across different Vietnamese river basins and time horizons.

Abstract: Rainfall forecasting in Vietnam is highly challenging due to its diverse climatic conditions and strong geographical variability across river basins, yet accurate and reliable forecasts are vital for flood management, hydropower operation, and disaster preparedness. In this work, we propose a Matrix Profile-based Weighted Ensemble (MPWE), a regime-switching framework that dynamically captures covariant dependencies among multiple geographical model forecasts while incorporating redundancy-aware weighting to balance contributions across models. We evaluate MPWE using rainfall forecasts from eight major basins in Vietnam, spanning five forecast horizons (1 hour and accumulated rainfall over 12, 24, 48, 72, and 84 hours). Experimental results show that MPWE consistently achieves lower mean and standard deviation of prediction errors compared to geographical models and ensemble baselines, demonstrating both improved accuracy and stability across basins and horizons.

Hong Liu, Kerui Cen, Yanxing Chen, Zige Liu, Dong Chen, Zifeng Yang, Chitin Hon

Main category: cs.LG

TL;DR: MAESTRO is a unified multi-modal framework that integrates surveillance, web search, and weather data with spectro-temporal modeling for accurate influenza forecasting, achieving R-square of 0.956 on Hong Kong data.

Details

Motivation: Timely and robust influenza incidence forecasting is critical for public health decision-making to enable better outbreak response and resource allocation.

Method: Multi-modal data fusion integrating surveillance, web search trends, and meteorological data with adaptive weighting and spectro-temporal pattern decomposition.

Result: State-of-the-art performance with R-square of 0.956 on 11+ years of Hong Kong influenza data, with ablations confirming significant contributions of multi-modal and spectro-temporal components.

Conclusion: MAESTRO provides a powerful, modular, and reproducible tool for epidemiological forecasting that can be extended to other regions and pathogens, with publicly available pipeline.

Abstract: Timely and robust influenza incidence forecasting is critical for public health decision-making. This paper presents MAESTRO (Multi-modal Adaptive Estimation for Temporal Respiratory Disease Outbreak), a novel, unified framework that synergistically integrates advanced spectro-temporal modeling with multi-modal data fusion, including surveillance, web search trends, and meteorological data. By adaptively weighting heterogeneous data sources and decomposing complex time series patterns, the model achieves robust and accurate forecasts. Evaluated on over 11 years of Hong Kong influenza data (excluding the COVID-19 period), MAESTRO demonstrates state-of-the-art performance, achieving a superior model fit with an R-square of 0.956. Extensive ablations confirm the significant contributions of its multi-modal and spectro-temporal components. The modular and reproducible pipeline is made publicly available to facilitate deployment and extension to other regions and pathogens, presenting a powerful tool for epidemiological forecasting.

[306] Quantum-Enhanced Forecasting for Deep Reinforcement Learning in Algorithmic Trading

Jun-Hao Chen, Yu-Chien Huang, Yun-Cheng Tsai, Samuel Yen-Chi Chen

Main category: cs.LG

TL;DR: Quantum-inspired neural networks combined with deep reinforcement learning create a trading agent for USD/TWD that achieves 11.87% return with minimal drawdown, outperforming currency ETFs.

Details

Motivation: To explore the convergence of quantum-inspired neural networks and deep reinforcement learning for improved financial trading performance, specifically in FX markets.

Method: Integrated Quantum Long Short-Term Memory (QLSTM) for short-term trend prediction with Quantum Asynchronous Advantage Actor-Critic (QA3C) algorithm. Trained on 2000-2025 data (80% training, 20% testing) with specific state design, reward function for trend-following and risk control, and multi-core training.

Result: The long-only agent achieved 11.87% return over approximately 5 years with only 0.92% maximum drawdown, outperforming several currency ETFs. Hybrid models demonstrated competitive FX trading performance.

Conclusion: QLSTM proves effective for small-profit trades with tight risk control. The hybrid quantum-classical approach shows promise for financial trading applications, though limitations include classical quantum simulation and simplified strategy.

Abstract: The convergence of quantum-inspired neural networks and deep reinforcement learning offers a promising avenue for financial trading. We implemented a trading agent for USD/TWD by integrating Quantum Long Short-Term Memory (QLSTM) for short-term trend prediction with Quantum Asynchronous Advantage Actor-Critic (QA3C), a quantum-enhanced variant of the classical A3C. Trained on data from 2000-01-01 to 2025-04-30 (80% training, 20% testing), the long-only agent achieves 11.87% return over around 5 years with 0.92% max drawdown, outperforming several currency ETFs. We detail state design (QLSTM features and indicators), reward function for trend-following/risk control, and multi-core training. Results show hybrid models yield competitive FX trading performance. Implications include QLSTM’s effectiveness for small-profit trades with tight risk and future enhancements. Key hyperparameters: QLSTM sequence length$=$4, QA3C workers$=$8. Limitations: classical quantum simulation and simplified strategy. \footnote{The views expressed in this article are those of the authors and do not represent the views of Wells Fargo. This article is for informational purposes only. Nothing contained in this article should be construed as investment advice. Wells Fargo makes no express or implied warranties and expressly disclaims all legal, tax, and accounting implications related to this article.

[307] Kriging prior Regression: A Case for Kriging-Based Spatial Features with TabPFN in Soil Mapping

Jonas Schmidinger, Viacheslav Barkov, Sebastian Vogel, Martin Atzmueller, Gerard B M Heuvelink

Main category: cs.LG

TL;DR: Kriging prior regression (KpR) combines machine learning with spatial context using spatial lag features from kriging, improving soil property prediction accuracy by 30% R2 compared to non-spatial ML methods.

Details

Motivation: To bridge the gap between machine learning (which captures feature relationships) and geostatistics (which leverages spatial structure) for more accurate digital soil mapping in precision agriculture.

Method: Proposed KpR framework that enriches ML with spatial context through engineering of ‘spatial lag’ features from ordinary kriging, using TabPFN model on six field-scale datasets with soil properties and remote/proximal sensing features.

Result: KpR with TabPFN demonstrated reliable uncertainty estimates and significantly improved prediction accuracy (30% average R2 improvement) compared to both spatial techniques and non-spatial ML algorithms like random forest.

Conclusion: KpR with TabPFN is a robust and versatile modeling framework for digital soil mapping, particularly effective for small sample sizes common in precision agriculture and when proximal soil sensing data are limited.

Abstract: Machine learning and geostatistics are two fundamentally different frameworks for predicting and spatially mapping soil properties. Geostatistics leverages the spatial structure of soil properties, while machine learning captures the relationship between available environmental features and soil properties. We propose a hybrid framework that enriches ML with spatial context through engineering of ‘spatial lag’ features from ordinary kriging. We call this approach ‘kriging prior regression’ (KpR), as it follows the inverse logic of regression kriging. To evaluate this approach, we assessed both the point and probabilistic prediction performance of KpR, using the TabPFN model across six fieldscale datasets from LimeSoDa. These datasets included soil organic carbon, clay content, and pH, along with features derived from remote sensing and in-situ proximal soil sensing. KpR with TabPFN demonstrated reliable uncertainty estimates and more accurate predictions in comparison to several other spatial techniques (e.g., regression/residual kriging with TabPFN), as well as to established non-spatial machine learning algorithms (e.g., random forest). Most notably, it significantly improved the average R2 by around 30% compared to machine learning algorithms without spatial context. This improvement was due to the strong prediction performance of the TabPFN algorithm itself and the complementary spatial information provided by KpR features. TabPFN is particularly effective for prediction tasks with small sample sizes, common in precision agriculture, whereas KpR can compensate for weak relationships between sensing features and soil properties when proximal soil sensing data are limited. Hence, we conclude that KpR with TabPFN is a very robust and versatile modelling framework for digital soil mapping in precision agriculture.

[308] Balancing Utility and Privacy: Dynamically Private SGD with Random Projection

Zhanhong Jiang, Md Zahid Hasan, Nastaran Saadati, Aditya Balu, Chao Liu, Soumik Sarkar

Main category: cs.LG

TL;DR: D2P2-SGD combines dynamic differential privacy with automatic gradient clipping and random projection to improve privacy-utility tradeoff in SGD optimization while maintaining efficiency.

Details

Motivation: Existing DPSGD has static noise that impacts model performance, and increasing model parameters make efficient learning challenging while addressing privacy leakage concerns.

Method: Combines dynamic differential privacy with automatic gradient clipping and random projection with SGD to dynamically adjust privacy-utility tradeoff.

Result: Exhibits provably sub-linear convergence rates across different objective functions and shows remarkable accuracy enhancement while maintaining privacy in diverse datasets.

Conclusion: D2P2-SGD provides better utility at privacy cost through DDP and enables more efficient model learning through random projection, achieving state-of-the-art performance.

Abstract: Stochastic optimization is a pivotal enabler in modern machine learning, producing effective models for various tasks. However, several existing works have shown that model parameters and gradient information are susceptible to privacy leakage. Although Differentially Private SGD (DPSGD) addresses privacy concerns, its static noise mechanism impacts the error bounds for model performance. Additionally, with the exponential increase in model parameters, efficient learning of these models using stochastic optimizers has become more challenging. To address these concerns, we introduce the Dynamically Differentially Private Projected SGD (D2P2-SGD) optimizer. In D2P2-SGD, we combine two important ideas: (i) dynamic differential privacy (DDP) with automatic gradient clipping and (ii) random projection with SGD, allowing dynamic adjustment of the tradeoff between utility and privacy of the model. It exhibits provably sub-linear convergence rates across different objective functions, matching the best available rate. The theoretical analysis further suggests that DDP leads to better utility at the cost of privacy, while random projection enables more efficient model learning. Extensive experiments across diverse datasets show that D2P2-SGD remarkably enhances accuracy while maintaining privacy. Our code is available here.

cs.MA

[309] Tackling One Health Risks: How Large Language Models are leveraged for Risk Negotiation and Consensus-building

Alexandra Fetsch, Iurii Savvateev, Racem Ben Romdhane, Martin Wiedmann, Artemiy Dimov, Maciej Durkalec, Josef Teichmann, Jakob Zinsstag, Konstantinos Koutsoumanis, Andreja Rajkovic, Jason Mann, Mauro Tonolla, Monika Ehling-Schulz, Matthias Filter, Sophia Johler

Main category: cs.MA

TL;DR: AI-assisted negotiation framework using LLMs and autonomous agents to address complex cross-sectoral challenges through simulated negotiations, systematic modeling, and impact evaluation.

Details

Motivation: Conventional risk analysis frameworks create silos that hinder comprehensive solutions for complex global challenges, while time constraints and information overload prevent effective stakeholder negotiations.

Method: Developed an AI-assisted negotiation framework incorporating large language models and AI-based autonomous agents into a negotiation-centered risk analysis workflow, with proof-of-concept implementations in biopesticide use and wild animal population control scenarios.

Result: The framework successfully mitigated information overload and augmented decision-making under time constraints, demonstrating potential for cross-sectoral engagement through open-source, web-based design accessible to users with limited resources.

Conclusion: AI-assisted negotiation shows promise for addressing the lack of tools for holistic, cross-sectoral problem-solving, enabling stakeholders to simulate negotiations, anticipate compromises, and evaluate solution impacts in complex scenarios.

Abstract: Key global challenges of our times are characterized by complex interdependencies and can only be effectively addressed through an integrated, participatory effort. Conventional risk analysis frameworks often reduce complexity to ensure manageability, creating silos that hinder comprehensive solutions. A fundamental shift towards holistic strategies is essential to enable effective negotiations between different sectors and to balance the competing interests of stakeholders. However, achieving this balance is often hindered by limited time, vast amounts of information, and the complexity of integrating diverse perspectives. This study presents an AI-assisted negotiation framework that incorporates large language models (LLMs) and AI-based autonomous agents into a negotiation-centered risk analysis workflow. The framework enables stakeholders to simulate negotiations, systematically model dynamics, anticipate compromises, and evaluate solution impacts. By leveraging LLMs’ semantic analysis capabilities we could mitigate information overload and augment decision-making process under time constraints. Proof-of-concept implementations were conducted in two real-world scenarios: (i) prudent use of a biopesticide, and (ii) targeted wild animal population control. Our work demonstrates the potential of AI-assisted negotiation to address the current lack of tools for cross-sectoral engagement. Importantly, the solution’s open source, web based design, suits for application by a broader audience with limited resources and enables users to tailor and develop it for their own needs.

[310] A Holistic Architecture for Monitoring and Optimization of Robust Multi-Agent Path Finding Plan Execution

David Zahrádka, Denisa Mužíková, David Woller, Miroslav Kulich, Jiří Švancara, Roman Barták

Main category: cs.MA

TL;DR: A holistic architecture for robust execution of Multi-Agent Path Finding plans that monitors delays and decides when to search for alternative plans to optimize execution duration.

Details

Motivation: Robots executing MAPF plans may get delayed, introducing collision risks and impacting execution duration. Continuing with original optimal plans may become suboptimal due to accumulated delays, but searching for alternatives is costly.

Method: Uses Action Dependency Graph for robust execution to estimate expected execution duration. Predicts potential benefits of finding alternative plans. Evaluated in real-time simulator mimicking autonomous warehouse robotic fleet.

Result: The architecture enables monitoring of plan execution and intelligent decision-making about when to search for alternative plans to optimize overall execution time.

Conclusion: Proposed holistic approach effectively handles delays in MAPF execution by combining robust execution methods with intelligent monitoring and optimization decisions.

Abstract: The goal of Multi-Agent Path Finding (MAPF) is to find a set of paths for a fleet of agents moving in a shared environment such that the agents reach their goals without colliding with each other. In practice, some of the robots executing the plan may get delayed, which can introduce collision risk. Although robust execution methods are used to ensure safety even in the presence of delays, the delays may still have a significant impact on the duration of the execution. At some point, the accumulated delays may become significant enough that instead of continuing with the execution of the original plan, even if it was optimal, there may now exist an alternate plan which will lead to a shorter execution. However, the problem is how to decide when to search for the alternate plan, since it is a costly procedure. In this paper, we propose a holistic architecture for robust execution of MAPF plans, its monitoring and optimization. We exploit a robust execution method called Action Dependency Graph to maintain an estimate of the expected execution duration during the plan’s execution. This estimate is used to predict the potential that finding an alternate plan would lead to shorter execution. We empirically evaluate the architecture in experiments in a real-time simulator which we designed to mimic our real-life demonstrator of an autonomous warehouse robotic fleet.

cs.MM

[311] Out-Of-Distribution Detection for Audio-visual Generalized Zero-Shot Learning: A General Framework

Liuyuan Wen

Main category: cs.MM

TL;DR: A unified framework combining generative and embedding methods for audio-visual generalized zero-shot learning using OOD detection to classify both seen and unseen classes.

Details

Motivation: Existing GZSL methods face challenges - generative training is unstable while embedding methods suffer from domain shift. Integrating both approaches can leverage their strengths while mitigating weaknesses.

Method: Uses GANs to synthesize unseen features, trains OOD detector to identify seen/unseen classes, then employs separate classifiers for each feature type based on the detection result.

Result: Significant improvement over state-of-the-art methods on three popular audio-visual datasets.

Conclusion: The proposed OOD detection framework effectively combines generative and embedding approaches, achieving superior performance in audio-visual GZSL by addressing the limitations of individual methods.

Abstract: Generalized Zero-Shot Learning (GZSL) is a challenging task requiring accurate classification of both seen and unseen classes. Within this domain, Audio-visual GZSL emerges as an extremely exciting yet difficult task, given the inclusion of both visual and acoustic features as multi-modal inputs. Existing efforts in this field mostly utilize either embedding-based or generative-based methods. However, generative training is difficult and unstable, while embedding-based methods often encounter domain shift problem. Thus, we find it promising to integrate both methods into a unified framework to leverage their advantages while mitigating their respective disadvantages. Our study introduces a general framework employing out-of-distribution (OOD) detection, aiming to harness the strengths of both approaches. We first employ generative adversarial networks to synthesize unseen features, enabling the training of an OOD detector alongside classifiers for seen and unseen classes. This detector determines whether a test feature belongs to seen or unseen classes, followed by classification utilizing separate classifiers for each feature type. We test our framework on three popular audio-visual datasets and observe a significant improvement comparing to existing state-of-the-art works. Codes can be found in https://github.com/liuyuan-wen/AV-OOD-GZSL.

eess.AS

[312] Spectral Bottleneck in Deep Neural Networks: Noise is All You Need

Hemanth Chandravamsi, Dhanush V. Shenoy, Itay Zinn, Shimon Pisnoy, Steven H. Frankel

Main category: eess.AS

TL;DR: WINNER proposes a weight perturbation scheme with adaptive Gaussian noise scales based on target signal’s spectral centroid to overcome spectral bottleneck in neural networks when fitting high-frequency-dominant signals.

Details

Motivation: Deep neural networks suffer from spectral learning bias where low frequencies are learned first, creating a spectral bottleneck when target signals lack low-frequency components and are dominated by high frequencies, particularly problematic in implicit neural representations.

Method: Proposes WINNER - weight initialization with noise for neural representations, which perturbs uniformly initialized weights with Gaussian noise where noise scales are adaptively determined by the spectral centroid of the target signal.

Result: The method addresses spectral bottleneck, yields faster convergence, improves representation accuracy, outperforms state-of-the-art in audio fitting, and achieves notable gains in image fitting and denoising tasks.

Conclusion: WINNER provides an effective solution for fitting any target signal regardless of frequency content and opens new directions for adaptive weight initialization strategies in computer vision and scientific machine learning.

Abstract: Deep neural networks are known to exhibit a spectral learning bias, wherein low-frequency components are learned early in training, while high-frequency modes emerge more gradually in later epochs. However, when the target signal lacks low-frequency components and is dominated by broadband high frequencies, training suffers from a ‘spectral bottleneck’, and the model fails to reconstruct the entire signal, including the frequency components that lie within the network’s representational capacity. We examine such a scenario in the context of implicit neural representations (INRs) with sinusoidal representation networks (SIRENs), focusing on the challenge of fitting high-frequency-dominant signals that are susceptible to spectral bottleneck. To effectively fit any target signal irrespective of it’s frequency content, we propose a generalized target-aware ‘weight perturbation scheme’ (WINNER - weight initialization with noise for neural representations) for network initialization. The scheme perturbs uniformly initialized weights with Gaussian noise, where the noise scales are adaptively determined by the spectral centroid of the target signal. We show that the noise scales can provide control over the spectra of network activations and the eigenbasis of the empirical neural tangent kernel. This method not only addresses the spectral bottleneck but also yields faster convergence and with improved representation accuracy, outperforming state-of-the-art approaches in audio fitting and achieving notable gains in image fitting and denoising tasks. Beyond signal reconstruction, our approach opens new directions for adaptive weight initialization strategies in computer vision and scientific machine learning.

[313] The MSP-Podcast Corpus

Carlos Busso, Reza Lotfian, Kusha Sridhar, Ali N. Salman, Wei-Cheng Lin, Lucas Goncalves, Srinivas Parthasarathy, Abinay Reddy Naini, Seong-Gyun Leem, Luz Martinez-Lucas, Huang-Cheng Chou, Pravin Mote

Main category: eess.AS

TL;DR: The MSP-Podcast corpus is a 400+ hour emotional speech database collected from audio-sharing websites, featuring diverse emotional annotations and speaker identification to advance speech emotion recognition in real-world applications.

Details

Motivation: Existing emotional speech databases face limitations in size, emotional balance, and speaker diversity, hindering progress in speech emotion recognition for real-world scenarios.

Method: Collected diverse audio samples from audio-sharing websites with Common Licenses, annotated with primary/secondary emotional categories and valence/arousal/dominance attributes by at least five raters, using ML-driven pipeline for emotional diversity selection.

Result: Created a comprehensive 400+ hour corpus with speaker identification, human transcriptions, and rich emotional annotations, ensuring balanced emotional representation across speakers and environments.

Conclusion: The MSP-Podcast corpus provides a high-quality, diverse resource that better supports the development of speech emotion recognition systems for practical real-world applications.

Abstract: The availability of large, high-quality emotional speech databases is essential for advancing speech emotion recognition (SER) in real-world scenarios. However, many existing databases face limitations in size, emotional balance, and speaker diversity. This study describes the MSP-Podcast corpus, summarizing our ten-year effort. The corpus consists of over 400 hours of diverse audio samples from various audio-sharing websites, all of which have Common Licenses that permit the distribution of the corpus. We annotate the corpus with rich emotional labels, including primary (single dominant emotion) and secondary (multiple emotions perceived in the audio) emotional categories, as well as emotional attributes for valence, arousal, and dominance. At least five raters annotate these emotional labels. The corpus also has speaker identification for most samples, and human transcriptions of the lexical content of the sentences for the entire corpus. The data collection protocol includes a machine learning-driven pipeline for selecting emotionally diverse recordings, ensuring a balanced and varied representation of emotions across speakers and environments. The resulting database provides a comprehensive, high-quality resource, better suited for advancing SER systems in practical, real-world scenarios.

[314] Acoustic Scene Classification Using CNN-GRU Model Without Knowledge Distillation

Ee-Leng Tan, Jun Wei Yeow, Santi Peksi, Haowen Li, Ziyi Yang, Woon-Seng Gan

Main category: eess.AS

TL;DR: A lightweight CNN-GRU model for acoustic scene classification that achieves 60.25% accuracy with only 114.2KB memory and 10.9M MAC operations, trained solely on TAU Urban Acoustic Scene 2022 dataset with DIR augmentation.

Details

Motivation: To develop a low-complexity acoustic scene classification model for the DCASE 2025 challenge that departs from traditional knowledge distillation approaches and achieves high performance with minimal computational resources.

Method: Proposed a CNN-GRU model trained exclusively on TAU Urban Acoustic Scene 2022 Mobile development dataset, using MicIRP for device impulse response (DIR) augmentation without external datasets.

Result: The model achieved 60.25% accuracy on the development dataset with extremely low resource requirements: 114.2KB memory usage and 10.9M multiply-and-accumulate operations.

Conclusion: The approach demonstrates that effective low-complexity acoustic scene classification can be achieved without knowledge distillation, using a carefully designed CNN-GRU architecture with appropriate data augmentation techniques.

Abstract: In this technical report, we present the SNTL-NTU team’s Task 1 submission for the Low-Complexity Acoustic Scenes and Events (DCASE) 2025 challenge. This submission departs from the typical application of knowledge distillation from a teacher to a student model, aiming to achieve high performance with limited complexity. The proposed model is based on a CNN-GRU model and is trained solely using the TAU Urban Acoustic Scene 2022 Mobile development dataset, without utilizing any external datasets, except for MicIRP, which is used for device impulse response (DIR) augmentation. The proposed model has a memory usage of 114.2KB and requires 10.9M muliply-and-accumulate (MAC) operations. Using the development dataset, the proposed model achieved an accuracy of 60.25%.

[315] Effective Modeling of Critical Contextual Information for TDNN-based Speaker Verification

Shilong Weng, Liu Yang, Ji Mao

Main category: eess.AS

TL;DR: Three improved ECAPA-TDNN architectures that better capture multi-scale contextual features for speaker verification, achieving 23% lower EER than original ECAPA-TDNN on VoxCeleb1-O.

Details

Motivation: ECAPA-TDNN's hierarchical convolutional structure in SE-Res2Block cannot fully utilize contextual information, resulting in weak ability to model effective context dependencies for speaker verification.

Method: Proposed three improved architectures based on ECAPA-TDNN to fully extract multi-scale features with context dependence and aggregate these features effectively.

Result: Experimental results on VoxCeleb and CN-Celeb show effectiveness. One architecture achieves nearly 23% lower Equal Error Rate compared to ECAPA-TDNN on VoxCeleb1-O dataset.

Conclusion: The proposed architectures demonstrate competitive performance among current TDNN architectures with comparable parameter count, effectively addressing the contextual modeling limitations of ECAPA-TDNN.

Abstract: Today, Time Delay Neural Network (TDNN) has become the mainstream architecture for speaker verification task, in which the ECAPA-TDNN is one of the state-of-the-art models. The current works that focus on improving TDNN primarily address the limitations of TDNN in modeling global information and bridge the gap between TDNN and 2-Dimensional convolutions. However, the hierarchical convolutional structure in the SE-Res2Block proposed by ECAPA-TDNN cannot make full use of the contextual information, resulting in the weak ability of ECAPA-TDNN to model effective context dependencies. To this end, three improved architectures based on ECAPA-TDNN are proposed to fully and effectively extract multi-scale features with context dependence and then aggregate these features. The experimental results on VoxCeleb and CN-Celeb verify the effectiveness of the three proposed architectures. One of these architectures achieves nearly a 23% lower Equal Error Rate compared to that of ECAPA-TDNN on VoxCeleb1-O dataset, demonstrating the competitive performance achievable among the current TDNN architectures under the comparable parameter count.

[316] Whisper Has an Internal Word Aligner

Sung-Lin Yeh, Yen Meng, Hao Tang

Main category: eess.AS

TL;DR: Unsupervised method using Whisper’s attention heads with character inputs produces more accurate word alignments than previous approaches without requiring training.

Details

Motivation: There's growing need for precise word-level timestamps from ASR systems like Whisper, but existing methods either need additional training or aren't competitive, with loose evaluation standards (200ms+ tolerance).

Method: Analyze Whisper’s attention heads to identify those capturing accurate word alignments, use character inputs instead of wordpieces for finer alignment, and propose unsupervised filtering approach during teacher forcing.

Result: The approach produces word alignments more accurate than prior work under stricter tolerance levels (20-100ms) without requiring any training.

Conclusion: Specific attention heads in Whisper capture precise word alignments, and using character inputs with proper head filtering enables superior unsupervised word alignment extraction.

Abstract: There is an increasing interest in obtaining accurate word-level timestamps from strong automatic speech recognizers, in particular Whisper. Existing approaches either require additional training or are simply not competitive. The evaluation in prior work is also relatively loose, typically using a tolerance of more than 200 ms. In this work, we discover attention heads in Whisper that capture accurate word alignments and are distinctively different from those that do not. Moreover, we find that using characters produces finer and more accurate alignments than using wordpieces. Based on these findings, we propose an unsupervised approach to extracting word alignments by filtering attention heads while teacher forcing Whisper with characters. Our approach not only does not require training but also produces word alignments that are more accurate than prior work under a stricter tolerance between 20 ms and 100 ms.

[317] Unified Learnable 2D Convolutional Feature Extraction for ASR

Peter Vieting, Benedikt Hilmes, Ralf Schlüter, Hermann Ney

Main category: eess.AS

TL;DR: A generic 2D convolutional neural front-end for ASR that reduces reliance on classical methods, achieves parameter efficiency, and matches performance of existing supervised feature extractors.

Details

Motivation: To develop a more generic and unified front-end architecture for speech feature extraction that reduces dependence on classical methods and avoids complex layer compositions from different sources.

Method: A 2D convolutional neural network front-end that is parameter-efficient and designed for limited computational resources, systematically reducing influence of existing techniques.

Result: The generic unified approach is feasible and matches the performance of existing supervised learnable feature extractors while being computationally efficient.

Conclusion: A simple 2D convolutional front-end can serve as an effective generic feature extractor for ASR tasks without the complexity of large pre-trained models or classical method dependencies.

Abstract: Neural front-ends represent a promising approach to feature extraction for automatic speech recognition (ASR) systems as they enable to learn specifically tailored features for different tasks. Yet, many of the existing techniques remain heavily influenced by classical methods. While this inductive bias may ease the system design, our work aims to develop a more generic front-end for feature extraction. Furthermore, we seek to unify the front-end architecture contrasting with existing approaches that apply a composition of several layer topologies originating from different sources. The experiments systematically show how to reduce the influence of existing techniques to achieve a generic front-end. The resulting 2D convolutional front-end is parameter-efficient and suitable for a scenario with limited computational resources unlike large models pre-trained on unlabeled audio. The results demonstrate that this generic unified approach is not only feasible but also matches the performance of existing supervised learnable feature extractors.

[318] Towards Data Drift Monitoring for Speech Deepfake Detection in the context of MLOps

Xin Wang, Wanying Ge, Junichi Yamagishi

Main category: eess.AS

TL;DR: This paper addresses the vulnerability of static speech deepfake detectors to new attacks by proposing MLOps-based monitoring of data drift and fine-tuning strategies to maintain detection performance.

Details

Motivation: Static speech deepfake detectors deployed in cloud services become vulnerable to newly created speech deepfake attacks over time, requiring continuous monitoring and adaptation.

Method: The authors monitor data drift using distribution distances between new and reference data, and fine-tune detectors using similarly drifted data from new text-to-speech (TTS) attacks.

Result: Experiments on toy dataset and large-scale MLAAD dataset show that drift from new TTS attacks can be effectively monitored, and fine-tuning reduces both drift and detection error rates.

Conclusion: Continuous monitoring of data drift and adaptive fine-tuning using new attack data can effectively maintain speech deepfake detection performance against evolving threats.

Abstract: When being delivered in applications or services on the cloud, static speech deepfake detectors that are not updated will become vulnerable to newly created speech deepfake attacks. From the perspective of machine learning operations (MLOps), this paper tries to answer whether we can monitor new and unseen speech deepfake data that drifts away from a seen reference data set. We further ask, if drift is detected, whether we can fine-tune the detector using similarly drifted data, reduce the drift, and improve the detection performance. On a toy dataset and the large-scale MLAAD dataset, we show that the drift caused by new text-to-speech (TTS) attacks can be monitored using distances between the distributions of the new data and reference data. Furthermore, we demonstrate that fine-tuning the detector using data generated by the new TTS deepfakes can reduce the drift and the detection error rates.

[319] Error Analysis in a Modular Meeting Transcription System

Peter Vieting, Simon Berger, Thilo von Neumann, Christoph Boeddeker, Ralf Schlüter, Reinhold Haeb-Umbach

Main category: eess.AS

TL;DR: Analysis of leakage in speech separation for meeting transcription, showing cross-channel leakage occurs but doesn’t significantly impact performance due to VAD filtering. Advanced diarization reduces gap to oracle segmentation by one third compared to simple VAD.

Details

Motivation: Meeting transcription has seen significant progress but still faces challenges. This work aims to analyze leakage issues in speech separation systems and understand how they affect transcription performance.

Method: Extended a previously proposed framework for analyzing leakage in speech separation with improved temporal locality sensitivity. Compared different segmentation approaches including energy-based VAD and advanced diarization methods against oracle segmentation.

Result: Significant cross-channel leakage occurs in areas where only the primary speaker is active, but this doesn’t affect final performance much as leaked parts are largely ignored by VAD. Advanced diarization reduces the gap to oracle segmentation by one third compared to simple energy-based VAD.

Conclusion: The study provides insights into leakage mechanisms in speech separation and demonstrates state-of-the-art performance on LibriCSS among systems trained only on LibriSpeech data, while identifying factors contributing to remaining performance gaps.

Abstract: Meeting transcription is a field of high relevance and remarkable progress in recent years. Still, challenges remain that limit its performance. In this work, we extend a previously proposed framework for analyzing leakage in speech separation with proper sensitivity to temporal locality. We show that there is significant leakage to the cross channel in areas where only the primary speaker is active. At the same time, the results demonstrate that this does not affect the final performance much as these leaked parts are largely ignored by the voice activity detection (VAD). Furthermore, different segmentations are compared showing that advanced diarization approaches are able to reduce the gap to oracle segmentation by a third compared to a simple energy-based VAD. We additionally reveal what factors contribute to the remaining difference. The results represent state-of-the-art performance on LibriCSS among systems that train the recognition module on LibriSpeech data only.

[320] Low-latency Assistive Audio Enhancement for Neurodivergent People

Alexander Popescu, Rosie Frost, Milos Cernak

Main category: eess.AS

TL;DR: Researchers developed audio enhancement algorithms to filter distressing sounds for neurodivergent individuals, using trigger sounds identified from online communities and achieving best results with Dynamic Range Compression.

Details

Motivation: 50-70% of neurodivergent people experience sound sensitivity that causes discomfort to severe distress, creating a critical need for assistive audio technologies.

Method: Curated trigger sounds from neurodivergent communities on Reddit, compiled dataset from FSD50K and ESC50, then trained and evaluated various DSP and ML audio enhancement algorithms.

Result: Dynamic Range Compression (DRC) was the most effective approach, successfully attenuating trigger sounds and reducing auditory distress for neurodivergent listeners.

Conclusion: DRC-based audio enhancement shows promise as an effective assistive technology for managing sound sensitivity in neurodivergent populations.

Abstract: Neurodivergent people frequently experience decreased sound tolerance, with estimates suggesting it affects 50-70% of this population. This heightened sensitivity can provoke reactions ranging from mild discomfort to severe distress, highlighting the critical need for assistive audio enhancement technologies In this paper, we propose several assistive audio enhancement algorithms designed to selectively filter distressing sounds. To address this, we curated a list of potential trigger sounds by analyzing neurodivergent-focused communities on platforms such as Reddit. Using this list, a dataset of trigger sound samples was compiled from publicly available sources, including FSD50K and ESC50. These samples were then used to train and evaluate various Digital Signal Processing (DSP) and Machine Learning (ML) audio enhancement algorithms. Among the approaches explored, Dynamic Range Compression (DRC) proved the most effective, successfully attenuating trigger sounds and reducing auditory distress for neurodivergent listeners.

[321] Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders

Weiqiao Shan, Yuang Li, Yuhao Zhang, Yingfeng Luo, Chen Xu, Xiaofeng Zhao, Long Meng, Yunfei Lu, Min Zhang, Hao Yang, Tong Xiao, Jingbo Zhu

Main category: eess.AS

TL;DR: PaM (Prompt-aware Mixture) enhances Speech LLMs by using multiple audio encoders with task-specific experts that extract different features based on prompt instructions, achieving superior performance across multiple audio understanding tasks.

Details

Motivation: Different audio understanding tasks require distinct features - some emphasize semantic aspects while others focus on acoustic aspects. Single unified audio features may not be optimal for all tasks, making task-specific audio features more desirable.

Method: Proposes Prompt-aware Mixture (PaM) that uses multiple audio encoders with different experts to extract task-specific features based on the prompt indicating different tasks. This allows the Speech LLM to generate appropriate features for each specific audio understanding task.

Result: With PaM, a single Speech LLM surpasses the best performances achieved by all single-encoder Speech LLMs on ASR, Speaker Number Verification, and Audio Captioning tasks. PaM also outperforms other feature fusion baselines like concatenation and averaging.

Conclusion: The proposed PaM approach effectively addresses the need for task-specific audio features in Speech LLMs, demonstrating superior performance across multiple audio understanding tasks compared to single-encoder approaches and other fusion methods.

Abstract: Connecting audio encoders with large language models (LLMs) allows the LLM to perform various audio understanding tasks, such as automatic speech recognition (ASR) and audio captioning (AC). Most research focuses on training an adapter layer to generate a unified audio feature for the LLM. However, different tasks may require distinct features that emphasize either semantic or acoustic aspects, making task-specific audio features more desirable. In this paper, we propose Prompt-aware Mixture (PaM) to enhance the Speech LLM that uses multiple audio encoders. Our approach involves using different experts to extract different features based on the prompt that indicates different tasks. Experiments demonstrate that with PaM, only one Speech LLM surpasses the best performances achieved by all single-encoder Speech LLMs on ASR, Speaker Number Verification, and AC tasks. PaM also outperforms other feature fusion baselines, such as concatenation and averaging. Our code would be available at: https://github.com/shanweiqiao/PaM

[322] Diffusion Buffer: Online Diffusion-based Speech Enhancement with Sub-Second Latency

Bunlong Lay, Rostislav Makarov, Timo Gerkmann

Main category: eess.AS

TL;DR: A sliding window diffusion framework for real-time speech enhancement that trades off performance and latency by progressively corrupting speech signals through time with more noise to recent frames.

Details

Motivation: Diffusion models show remarkable success in speech enhancement but are computationally expensive and impractical for real-time streaming data processing due to high inference time.

Method: Adapt a sliding window diffusion framework that progressively corrupts speech signals through time, assigning more noise to frames close to the present in a buffer, enabling denoised output with controllable delay.

Result: Outperforms standard diffusion models, runs efficiently on GPU, achieves input-output latency of 0.3-1 seconds, making it the first practical diffusion-based solution for online speech enhancement.

Conclusion: The proposed sliding window approach successfully addresses the computational limitations of diffusion models for real-time applications, providing an effective trade-off between performance and latency for online speech enhancement.

Abstract: Diffusion models are a class of generative models that have been recently used for speech enhancement with remarkable success but are computationally expensive at inference time. Therefore, these models are impractical for processing streaming data in real-time. In this work, we adapt a sliding window diffusion framework to the speech enhancement task. Our approach progressively corrupts speech signals through time, assigning more noise to frames close to the present in a buffer. This approach outputs denoised frames with a delay proportional to the chosen buffer size, enabling a trade-off between performance and latency. Empirical results demonstrate that our method outperforms standard diffusion models and runs efficiently on a GPU, achieving an input-output latency in the order of 0.3 to 1 seconds. This marks the first practical diffusion-based solution for online speech enhancement.

[323] IS${}^3$ : Generic Impulsive–Stationary Sound Separation in Acoustic Scenes using Deep Filtering

Clémentine Berger, Paraskevas Stamatiadis, Roland Badeau, Slim Essid

Main category: eess.AS

TL;DR: IS³ neural network separates impulsive acoustic events from stationary backgrounds using deep filtering, outperforming existing methods on objective metrics.

Details

Motivation: Need for audio systems that can differentiate between stationary backgrounds and isolated acoustic events for applications like robust audio rendering, noise suppression, and acoustic event classification.

Method: Deep filtering neural network architecture trained with sophisticated data generation pipeline that curates and adapts existing datasets for impulsive-stationary sound separation.

Result: Outperforms Harmonic-Percussive Sound Separation masking method (adapted from music processing) and wavelet filtering on objective separation metrics.

Conclusion: Learning-based approach with lightweight neural architecture and well-designed training data successfully addresses previously unaddressed task of impulsive-stationary sound separation.

Abstract: We are interested in audio systems capable of performing a differentiated processing of stationary backgrounds and isolated acoustic events within an acoustic scene, whether for applying specific processing methods to each part or for focusing solely on one while ignoring the other. Such systems have applications in real-world scenarios, including robust adaptive audio rendering systems (e.g., EQ or compression), plosive attenuation in voice mixing, noise suppression or reduction, robust acoustic event classification or even bioacoustics. To this end, we introduce IS${}^3$, a neural network designed for Impulsive–Stationary Sound Separation, that isolates impulsive acoustic events from the stationary background using a deep filtering approach, that can act as a pre-processing stage for the above-mentioned tasks. To ensure optimal training, we propose a sophisticated data generation pipeline that curates and adapts existing datasets for this task. We demonstrate that a learning-based approach, build on a relatively lightweight neural architecture and trained with well-designed and varied data, is successful in this previously unaddressed task, outperforming the Harmonic–Percussive Sound Separation masking method, adapted from music signal processing research, and wavelet filtering on objective separation metrics.

eess.IV

[324] Automated Tuning for Diffusion Inverse Problem Solvers without Generative Prior Retraining

Yaşar Utku Alçalar, Junno Yun, Mehmet Akçakaya

Main category: eess.IV

TL;DR: ZADS is a zero-shot adaptive diffusion sampling method that optimizes fidelity weights across arbitrary noise schedules for MRI reconstruction without retraining the diffusion prior.

Details

Motivation: Existing diffusion-based MRI reconstruction methods rely on heuristics or fixed fidelity weights that fail to generalize across varying measurement conditions and irregular timestep schedules.

Method: Treats denoising process as fixed unrolled sampler and optimizes fidelity weights in self-supervised manner using only undersampled measurements, without requiring diffusion prior retraining.

Result: Outperforms both traditional compressed sensing and recent diffusion-based methods on fastMRI knee dataset, delivering high-fidelity reconstructions across varying noise schedules and acquisition settings.

Conclusion: ZADS provides an effective test-time optimization approach that adaptively tunes fidelity weights for improved MRI reconstruction performance without additional training requirements.

Abstract: Diffusion/score-based models have recently emerged as powerful generative priors for solving inverse problems, including accelerated MRI reconstruction. While their flexibility allows decoupling the measurement model from the learned prior, their performance heavily depends on carefully tuned data fidelity weights, especially under fast sampling schedules with few denoising steps. Existing approaches often rely on heuristics or fixed weights, which fail to generalize across varying measurement conditions and irregular timestep schedules. In this work, we propose Zero-shot Adaptive Diffusion Sampling (ZADS), a test-time optimization method that adaptively tunes fidelity weights across arbitrary noise schedules without requiring retraining of the diffusion prior. ZADS treats the denoising process as a fixed unrolled sampler and optimizes fidelity weights in a self-supervised manner using only undersampled measurements. Experiments on the fastMRI knee dataset demonstrate that ZADS consistently outperforms both traditional compressed sensing and recent diffusion-based methods, showcasing its ability to deliver high-fidelity reconstructions across varying noise schedules and acquisition settings.

[325] Accelerating 3D Photoacoustic Computed Tomography with End-to-End Physics-Aware Neural Operators

Jiayun Wang, Yousuf Aborahama, Arya Khokhar, Yang Zhang, Chuwei Wang, Karteekeya Sastry, Julius Berner, Yilin Luo, Boris Bonev, Zongyi Li, Kamyar Azizzadenesheli, Lihong V. Wang, Anima Anandkumar

Main category: eess.IV

TL;DR: Pano is an end-to-end physics-aware neural operator that reconstructs high-quality 3D photoacoustic images from sparse sensor data, reducing hardware requirements while maintaining image quality.

Details

Motivation: Current 3D PACT systems require dense transducer arrays and long acquisition times, limiting clinical translation. There's a need for methods that can achieve high-quality imaging with reduced hardware complexity.

Method: Pano uses spherical discrete-continuous convolutions to preserve sensor geometry, incorporates Helmholtz equation constraints for physical consistency, and operates resolution-independently across varying sensor configurations as an end-to-end neural operator.

Result: Pano achieves robust high-quality image reconstruction from both simulated and real experimental data, maintaining performance with significantly reduced transducer counts and limited-angle configurations while enabling real-time volumetric imaging.

Conclusion: The framework establishes a practical pathway for making 3D PACT more accessible for preclinical and clinical applications by substantially reducing hardware requirements without compromising reconstruction quality.

Abstract: Photoacoustic computed tomography (PACT) combines optical contrast with ultrasonic resolution, achieving deep-tissue imaging beyond the optical diffusion limit. While three-dimensional PACT systems enable high-resolution volumetric imaging for applications spanning transcranial to breast imaging, current implementations require dense transducer arrays and prolonged acquisition times, limiting clinical translation. We introduce Pano (PACT imaging neural operator), an end-to-end physics-aware model that directly learns the inverse acoustic mapping from sensor measurements to volumetric reconstructions. Unlike existing approaches (e.g. universal back-projection algorithm), Pano learns both physics and data priors while also being agnostic to the input data resolution. Pano employs spherical discrete-continuous convolutions to preserve hemispherical sensor geometry, incorporates Helmholtz equation constraints to ensure physical consistency and operates resolutionindependently across varying sensor configurations. We demonstrate the robustness and efficiency of Pano in reconstructing high-quality images from both simulated and real experimental data, achieving consistent performance even with significantly reduced transducer counts and limited-angle acquisition configurations. The framework maintains reconstruction fidelity across diverse sparse sampling patterns while enabling real-time volumetric imaging capabilities. This advancement establishes a practical pathway for making 3D PACT more accessible and feasible for both preclinical research and clinical applications, substantially reducing hardware requirements without compromising image reconstruction quality.

[326] Drone-Based Multispectral Imaging and Deep Learning for Timely Detection of Branched Broomrape in Tomato Farms

Mohammadreza Narimani, Alireza Pourreza, Ali Moghimi, Mohsen Mesgaran, Parastoo Farajpoor, Hamid Jafarbiglu

Main category: eess.IV

TL;DR: Drone-based multispectral imagery combined with LSTM deep learning and SMOTE achieved 88% accuracy in detecting parasitic broomrape in tomato crops, enabling early detection to protect California’s tomato industry.

Details

Motivation: Branched broomrape poses a severe threat to California's tomato industry (90% of US processing tomatoes), with conventional detection methods being difficult and chemical controls being costly, environmentally harmful, and ineffective.

Method: Combined drone-based multispectral imagery with Long Short-Term Memory (LSTM) deep learning networks, using SMOTE to handle class imbalance. Research conducted on infested tomato farm across five growth stages determined by growing degree days (GDD).

Result: At 897 GDD, achieved 79.09% accuracy and 70.36% recall. Best performance with all growth stages and SMOTE augmentation: 88.37% overall accuracy and 95.37% recall.

Conclusion: Temporal multispectral analysis with LSTM networks shows strong potential for early broomrape detection. UAV-based sensing with deep learning could provide powerful precision agriculture tool to reduce losses and improve sustainability in tomato production.

Abstract: This study addresses the escalating threat of branched broomrape (Phelipanche ramosa) to California’s tomato industry, which supplies over 90 percent of U.S. processing tomatoes. The parasite’s largely underground life cycle makes early detection difficult, while conventional chemical controls are costly, environmentally harmful, and often ineffective. To address this, we combined drone-based multispectral imagery with Long Short-Term Memory (LSTM) deep learning networks, using the Synthetic Minority Over-sampling Technique (SMOTE) to handle class imbalance. Research was conducted on a known broomrape-infested tomato farm in Woodland, Yolo County, CA, across five key growth stages determined by growing degree days (GDD). Multispectral images were processed to isolate tomato canopy reflectance. At 897 GDD, broomrape could be detected with 79.09 percent overall accuracy and 70.36 percent recall without integrating later stages. Incorporating sequential growth stages with LSTM improved detection substantially. The best-performing scenario, which integrated all growth stages with SMOTE augmentation, achieved 88.37 percent overall accuracy and 95.37 percent recall. These results demonstrate the strong potential of temporal multispectral analysis and LSTM networks for early broomrape detection. While further real-world data collection is needed for practical deployment, this study shows that UAV-based multispectral sensing coupled with deep learning could provide a powerful precision agriculture tool to reduce losses and improve sustainability in tomato production.

[327] Polarization Denoising and Demosaicking: Dataset and Baseline Method

Muhamad Daniel Ariff Bin Abdul Rahman, Yusuke Monno, Masayuki Tanaka, Masatoshi Okutomi

Main category: eess.IV

TL;DR: Proposes a dataset and baseline method for joint polarization denoising and demosaicking in division-of-focal-plane polarimeters, addressing a previously understudied problem due to lack of suitable evaluation data.

Details

Motivation: Division-of-focal-plane polarimeters capture multiple polarization orientations in one shot but require both denoising and demosaicking. While polarization demosaicking has been studied for noise-free cases, joint denoising and demosaicking research is scarce due to lack of evaluation datasets and baseline methods.

Method: Proposes a denoising-then-demosaicking approach using well-accepted signal processing components to create a reproducible baseline method. Also introduces a dataset with 40 real-world scenes and three noise-level conditions containing pairs of noisy mosaic inputs and noise-free full images.

Result: Experimental results show the proposed method exhibits higher image reconstruction performance than other alternative methods, providing a solid baseline for future research.

Conclusion: The paper successfully addresses the gap in polarization denoising and demosaicking research by providing both a comprehensive dataset and an effective baseline method that outperforms existing alternatives.

Abstract: A division-of-focal-plane (DoFP) polarimeter enables us to acquire images with multiple polarization orientations in one shot and thus it is valuable for many applications using polarimetric information. The image processing pipeline for a DoFP polarimeter entails two crucial tasks: denoising and demosaicking. While polarization demosaicking for a noise-free case has increasingly been studied, the research for the joint task of polarization denoising and demosaicking is scarce due to the lack of a suitable evaluation dataset and a solid baseline method. In this paper, we propose a novel dataset and method for polarization denoising and demosaicking. Our dataset contains 40 real-world scenes and three noise-level conditions, consisting of pairs of noisy mosaic inputs and noise-free full images. Our method takes a denoising-then-demosaicking approach based on well-accepted signal processing components to offer a reproducible method. Experimental results demonstrate that our method exhibits higher image reconstruction performance than other alternative methods, offering a solid baseline.

[328] Soft Tissue Simulation and Force Estimation from Heterogeneous Structures using Equivariant Graph Neural Networks

Madina Kojanazarova, Sidady El Hadramy, Jack Wilkie, Georg Rauter, Philippe C. Cattin

Main category: eess.IV

TL;DR: A graph neural network (GNN) using E(n)-equivariant message passing predicts soft tissue deformation and force from sparse point clouds, achieving real-time performance comparable to finite element methods with better generalization to rotated and cross-resolution scenarios.

Details

Motivation: Physics-based models like FEM provide high-fidelity soft tissue simulation but are computationally expensive and require extensive preprocessing, making them unsuitable for real-time surgical applications.

Method: Proposed a GNN architecture that incorporates internal anatomical information through binary tissue profiles and uses E(n)-equivariant message passing for robustness. Trained on both experimental data (silicone/bone phantom) and synthetic FEM simulations.

Result: Achieves comparable performance to baseline GNN, significantly outperforms in rotated and cross-resolution scenarios, shows strong generalization to unseen orientations and point densities, and provides significant speed improvement for real-time applications. Maintains sub-millimeter accuracy when fine-tuned on experimental data.

Conclusion: The approach offers an efficient, data-driven alternative to traditional simulations that can generalize across anatomical configurations and support interactive surgical environments in real-time.

Abstract: Accurately simulating soft tissue deformation is crucial for surgical training, pre-operative planning, and real-time haptic feedback systems. While physics-based models such as the finite element method (FEM) provide high-fidelity results, they are often computationally expensive and require extensive preprocessing. We propose a graph neural network (GNN) architecture that predicts both tissue surface deformation and applied force from sparse point clouds. The model incorporates internal anatomical information through binary tissue profiles beneath each point and leverages E(n)-equivariant message passing to improve robustness. We collected experimental data that comprises a real silicone and bone-like phantom, and complemented it with synthetic simulations generated using FEM. Our model achieves a comparable performance to a baseline GNN on standard test cases and significantly outperforms it in rotated and cross-resolution scenarios, showing a strong generalization to unseen orientations and point densities. It also achieves a significant speed improvement, offering a solution for real-time applications. When fine-tuned on experimental data, the model maintains sub-millimeter deformation accuracy despite limited sample size and measurement noise. The results demonstrate that our approach offers an efficient, data-driven alternative to traditional simulations, capable of generalizing across anatomical configurations and supporting interactive surgical environments.

[329] Multi-pathology Chest X-ray Classification with Rejection Mechanisms

Yehudit Aperstein, Amit Tzahar, Alon Gottlib, Tal Verber, Ravit Shagan Damti, Alexander Apartsin

Main category: eess.IV

TL;DR: Uncertainty-aware DenseNet-121 framework for chest X-ray diagnosis using entropy-based and confidence interval-based rejection to abstain from uncertain predictions, improving reliability in multi-label classification.

Details

Motivation: Address overconfidence risks in deep learning models for high-stakes medical imaging tasks, particularly multi-label chest X-ray classification where multiple co-occurring pathologies must be detected simultaneously.

Method: DenseNet-121 backbone enhanced with two selective prediction mechanisms (entropy-based rejection and confidence interval-based rejection) and quantile-based calibration for threshold tuning using global or class-specific strategies.

Result: Experiments on three large datasets (PadChest, NIH ChestX-ray14, MIMIC-CXR) show selective rejection improves accuracy-coverage trade-off, with entropy-based rejection achieving highest average AUC across all pathologies.

Conclusion: Supports integration of selective prediction into AI-assisted diagnostic workflows for safer, uncertainty-aware deployment of deep learning in clinical settings.

Abstract: Overconfidence in deep learning models poses a significant risk in high-stakes medical imaging tasks, particularly in multi-label classification of chest X-rays, where multiple co-occurring pathologies must be detected simultaneously. This study introduces an uncertainty-aware framework for chest X-ray diagnosis based on a DenseNet-121 backbone, enhanced with two selective prediction mechanisms: entropy-based rejection and confidence interval-based rejection. Both methods enable the model to abstain from uncertain predictions, improving reliability by deferring ambiguous cases to clinical experts. A quantile-based calibration procedure is employed to tune rejection thresholds using either global or class-specific strategies. Experiments conducted on three large public datasets (PadChest, NIH ChestX-ray14, and MIMIC-CXR) demonstrate that selective rejection improves the trade-off between diagnostic accuracy and coverage, with entropy-based rejection yielding the highest average AUC across all pathologies. These results support the integration of selective prediction into AI-assisted diagnostic workflows, providing a practical step toward safer, uncertainty-aware deployment of deep learning in clinical settings.

[330] Human Body Segment Volume Estimation with Two RGB-D Cameras

Giulia Bassani, Emilio Maoddi, Usman Asghar, Carlo Alberto Avizzano, Alessandro Filippeschi

Main category: eess.IV

TL;DR: A system using two RGB-D cameras to estimate human body segment volumes with accuracy comparable to 3D laser scanners, using enhanced ARAP registration techniques.

Details

Motivation: Accurate body volume estimation is crucial for health assessment, ergonomic design, and biomechanical modeling, but existing systems are often complex and expensive.

Method: Body Segment Volume Estimation (BSV) system using two RGB-D cameras with enhanced As-Rigid-As-Possible (ARAP) non-rigid registration that disconnects energy from single triangle mesh to improve geometrical coherence.

Result: Superior accuracy in human body volume estimation compared to state-of-the-art methods, capable of evaluating volume ratios between body segments for clinical applications.

Conclusion: The BSV system provides accurate body volume measurements with reduced complexity using only two cameras, making it practical for various applications including clinical use.

Abstract: In the field of human biometry, accurately estimating the volume of the whole body and its individual segments is of fundamental importance. Such measurements support a wide range of applications that include assessing health, optimizing ergonomic design, and customizing biomechanical models. In this work, we presented a Body Segment Volume Estimation (BSV) system to automatically compute whole-body and segment volumes using only two RGB-D cameras, thus limiting the system complexity. However, to maintain the accuracy comparable to 3D laser scanners, we enhanced the As-Rigid-As-Possible (ARAP) non-rigid registration techniques, disconnecting its energy from the single triangle mesh. Thus, we improved the geometrical coherence of the reconstructed mesh, especially in the lateral gap areas. We evaluated BSV starting from the RGB-D camera performances, through the results obtained with FAUST dataset human body models, and comparing with a state-of-the-art work, up to real acquisitions. It showed superior ability in accurately estimating human body volumes, and it allows evaluating volume ratios between proximal and distal body segments, which are useful indices in many clinical applications.

[331] PL-Net: Progressive Learning Network for Medical Image Segmentation

Kunpeng Mao, Ruoyu Li, Junlong Cheng, Danmei Huang, Zhiping Song, ZeKui Liu

Main category: eess.IV

TL;DR: PL-Net is a 2D medical image segmentation framework that uses progressive learning (IPL and EPL) to better fuse coarse and fine-grained semantic information without adding parameters, achieving competitive performance on medical datasets.

Details

Motivation: Existing deep learning segmentation methods focus on optimizing U-Net structure or adding modules, but overlook the complementation and fusion of coarse-grained and fine-grained semantic information.

Method: Proposed PL-Net with Internal Progressive Learning (IPL) for mixing different receptive fields and External Progressive Learning (EPL) for two-stage training to optimize coarse and fine-grained information fusion.

Result: Comprehensive evaluations on five medical image segmentation datasets show PL-Net achieves competitive segmentation performance without introducing additional learnable parameters.

Conclusion: PL-Net effectively addresses the fusion of coarse and fine-grained semantic information in medical image segmentation while maintaining parameter efficiency compared to other U-Net variants.

Abstract: In recent years, deep convolutional neural network-based segmentation methods have achieved state-of-the-art performance for many medical analysis tasks. However, most of these approaches rely on optimizing the U-Net structure or adding new functional modules, which overlooks the complementation and fusion of coarse-grained and fine-grained semantic information. To address these issues, we propose a 2D medical image segmentation framework called Progressive Learning Network (PL-Net), which comprises Internal Progressive Learning (IPL) and External Progressive Learning (EPL). PL-Net offers the following advantages: (1) IPL divides feature extraction into two steps, allowing for the mixing of different size receptive fields and capturing semantic information from coarse to fine granularity without introducing additional parameters; (2) EPL divides the training process into two stages to optimize parameters and facilitate the fusion of coarse-grained information in the first stage and fine-grained information in the second stage. We conducted comprehensive evaluations of our proposed method on five medical image segmentation datasets, and the experimental results demonstrate that PL-Net achieves competitive segmentation performance. It is worth noting that PL-Net does not introduce any additional learnable parameters compared to other U-Net variants.

[332] Integrative Variational Autoencoders for Generative Modeling of an Image Outcome with Multiple Input Images

Bowen Lei, Yeseul Jeon, Rajarshi Guhaniyogi, Aaron Scheffler, Bani Mallick, Alzheimer’s Disease Neuroimaging Initiatives

Main category: eess.IV

TL;DR: InVA is a hierarchical VAE framework for multimodal neuroimaging that predicts outcome images using both shared and modality-specific features, outperforming traditional methods.

Details

Motivation: To address the need for better integration across multiple neuroimaging modalities and overcome limitations of standard VAEs which aren't designed for predictive cross-modal integration.

Method: Integrative Variational Autoencoder (InVA) - a hierarchical VAE framework that models outcome images as functions of both shared and modality-specific features in a flexible, data-driven approach.

Result: InVA outperforms conventional VAEs and nonlinear models like BART, and can accurately predict costly PET scans from structural MRI data.

Conclusion: InVA provides an efficient and powerful tool for multimodal neuroimaging research by enabling better cross-modal prediction without rigid assumptions of classical methods.

Abstract: Understanding relationships across multiple imaging modalities is central to neuroimaging research. We introduce the Integrative Variational Autoencoder (InVA), the first hierarchical VAE framework for image-on-image regression in multimodal neuroimaging. Unlike standard VAEs, which are not designed for predictive integration across modalities, InVA models outcome images as functions of both shared and modality-specific features. This flexible, data-driven approach avoids rigid assumptions of classical tensor regression and outperforms conventional VAEs and nonlinear models such as BART. As a key application, InVA accurately predicts costly PET scans from structural MRI, offering an efficient and powerful tool for multimodal neuroimaging.

[333] Generalized Ray Tracing with Basis functions for Tomographic Projections

Youssef Haouchat, Sepand Kashani, Philippe Thévenaz, Michael Unser

Main category: eess.IV

TL;DR: Efficient computation of x-ray projections using adapted ray tracing for images represented as linear combinations of overlapping basis functions, particularly splines, with validation in image reconstruction inverse problems.

Details

Motivation: To achieve precise and efficient computation of x-ray projections for images represented by overlapping basis functions, which is crucial for high-quality image reconstruction in inverse problems.

Method: Adaptation of ray tracing technique to compute line integrals for images expressed as linear combinations of general shifted basis functions, particularly focusing on spline representations and supporting arbitrary projection geometries.

Result: The proposed implementation successfully computes forward and backward operators over arbitrary lines, enabling precise x-ray projection calculations for various image reconstruction scenarios.

Conclusion: The adapted ray tracing method provides an efficient and precise solution for x-ray projection computation, maximizing image quality for given reconstruction grid resolution in inverse problems.

Abstract: This work aims at the precise and efficient computation of the x-ray projection of an image represented by a linear combination of general shifted basis functions that typically overlap. We achieve this with a suitable adaptation of ray tracing, which is one of the most efficient methods to compute line integrals. In our work, the cases in which the image is expressed as a spline are of particular relevance. The proposed implementation is applicable to any projection geometry as it computes the forward and backward operators over a collection of arbitrary lines. We validate our work with experiments in the context of inverse problems for image reconstruction and maximize the image quality for a given resolution of the reconstruction grid.

[334] Quanta Diffusion

Prateek Chennuri, Dongdong Fu, Stanley H. Chan

Main category: eess.IV

TL;DR: QuDi is a diffusion-based generative method for video reconstruction from single-photon sensors that handles motion and shot noise better than existing approaches, achieving 2.4 dB PSNR improvement.

Details

Motivation: Existing methods struggle with simultaneously managing motion and strong shot noise in extremely low-light imaging conditions using Quanta Image Sensors (QIS) and Single Photon Avalanche Diodes (SPADs).

Method: QuDi injects a physics-based forward model into the diffusion algorithm while keeping motion estimation in the loop, creating a powerful generative video reconstruction framework.

Result: The method demonstrates an average of 2.4 dB PSNR improvement over the best existing methods for single-photon imaging.

Conclusion: QuDi successfully overcomes the challenges of motion and shot noise management in low-light video reconstruction, providing superior performance for quantum image sensor applications.

Abstract: We present Quanta Diffusion (QuDi), a powerful generative video reconstruction method for single-photon imaging. QuDi is an algorithm supporting the latest Quanta Image Sensors (QIS) and Single Photon Avalanche Diodes (SPADs) for extremely low-light imaging conditions. Compared to existing methods, QuDi overcomes the difficulties of simultaneously managing the motion and the strong shot noise. The core innovation of QuDi is to inject a physics-based forward model into the diffusion algorithm, while keeping the motion estimation in the loop. QuDi demonstrates an average of 2.4 dB PSNR improvement over the best existing methods.

[335] Hadamard Encoded Row Column Ultrasonic Expansive Scanning (HERCULES) with Bias-Switchable Row-Column Arrays

Darren Olufemi Dahunsi, Randy Palmar, Tyler Henry, Mohammad Rahim Sobhani, Negar Majidi, Joy Wang, Afshin Kashani Ilkhechi, Jeremy Brown, Roger Zemp

Main category: eess.IV

TL;DR: HERCULES is a novel ultrasound imaging technique using TOBE arrays that enables expansive 3D scanning beyond the array’s physical aperture, achieving high-volume imaging rates comparable to traditional methods.

Details

Motivation: To overcome the limitations of non-bias-switchable row-column arrays (RCAs) and enable imaging beyond the array's shadow, potentially allowing whole organ imaging and 3D tissue visualization through limited windows.

Method: Uses Hadamard-Encoded-Read-Out (HERO) beamforming on a full 2D synthetic receive aperture, transmitting plane or cylindrical wavefronts. Implemented with custom TOBE array, biasing electronics, and research ultrasound system.

Result: Demonstrated comparable resolution to existing RCA methods at tens to hundreds of volumes per second. Validated with commercial phantom imaging and xenograft mouse model tissue imaging.

Conclusion: HERCULES successfully enables expansive 3D ultrasound imaging beyond traditional RCA limitations, with potential applications in whole organ imaging and tissue morphology visualization.

Abstract: Top-Orthogonal-to-Bottom-Electrode (TOBE) arrays, also known as bias-switchable row-column arrays (RCAs), allow for imaging techniques otherwise impossible for non-bias-switachable RCAs. Hadamard Encoded Row Column Ultrasonic Expansive Scanning (HERCULES) is a novel imaging technique that allows for expansive 3D scanning by transmitting plane or cylindrical wavefronts and receiving using Hadamard-Encoded-Read-Out (HERO) to perform beamforming on what is effectively a full 2D synthetic receive aperture. This allows imaging beyond the shadow of the aperture of the RCA array, potentially allows for whole organ imaging and 3D visualization of tissue morphology. It additionally enables view large volumes through limited windows. In this work we demonstrated with simulation that we are able to image at comparable resolution to existing RCA imaging methods at tens to hundreds of volumes per second. We validated these simulations by demonstrating an experimental implementation of HERCULES using a custom fabricated TOBE array, custom biasing electronics, and a research ultrasound system. Furthermore, we assess our imaging capabilities by imaging a commercial phantom, and comparing our results to those taken with traditional RCA imaging methods. Finally, we verified our ability to image real tissue by imaging a xenograft mouse model.

[336] Uncovering Neuroimaging Biomarkers of Brain Tumor Surgery with AI-Driven Methods

Carmen Jimenez-Mesa, Yizhou Wan, Guilio Sansone, Francisco J. Martinez-Murcia, Javier Ramirez, Pietro Lio, Juan M. Gorriz, Stephen J. Price, John Suckling, Michail Mamalakis

Main category: eess.IV

TL;DR: Novel XAI framework for brain tumor survival prediction using pre/post-surgery MRI data from 49 patients, with a global explanation optimizer that improves interpretability and identifies survival-related biomarkers in cognitive/sensory regions.

Details

Motivation: Brain tumor resection outcomes prediction is crucial but limited by rare curated datasets with both pre- and post-surgery imaging. Clinical, logistical and ethical challenges make such data collection difficult.

Method: Developed explainable AI framework integrating XAI with neuroimaging feature engineering. Created global explanation optimizer to refine survival-related feature attribution in deep learning models using structural MRI from 49 patients scanned pre- and post-surgery.

Result: Found that survival after oncological surgery is influenced by alterations in regions related to cognitive and sensory functions. Optimizer enhanced both fidelity and comprehensibility of model explanations beyond state-of-the-art XAI methods.

Conclusion: XAI-driven neuroimaging analysis identifies survival-related variability and has potential to inform precision medicine strategies in brain tumor treatment, highlighting importance of preserving decision-making and emotional regulation areas.

Abstract: Brain tumor resection is a highly complex procedure with profound implications for survival and quality of life. Predicting patient outcomes is crucial to guide clinicians in balancing oncological control with preservation of neurological function. However, building reliable prediction models is severely limited by the rarity of curated datasets that include both pre- and post-surgery imaging, given the clinical, logistical and ethical challenges of collecting such data. In this study, we develop a novel framework that integrates explainable artificial intelligence (XAI) with neuroimaging-based feature engineering for survival assessment in brain tumor patients. We curated structural MRI data from 49 patients scanned pre- and post-surgery, providing a rare resource for identifying survival-related biomarkers. A key methodological contribution is the development of a global explanation optimizer, which refines survival-related feature attribution in deep learning models, thereby improving both the interpretability and reliability of predictions. From a clinical perspective, our findings provide important evidence that survival after oncological surgery is influenced by alterations in regions related to cognitive and sensory functions. These results highlight the importance of preserving areas involved in decision-making and emotional regulation to improve long-term outcomes. From a technical perspective, the proposed optimizer advances beyond state-of-the-art XAI methods by enhancing both the fidelity and comprehensibility of model explanations, thus reinforcing trust in the recognition patterns driving survival prediction. This work demonstrates the utility of XAI-driven neuroimaging analysis in identifying survival-related variability and underscores its potential to inform precision medicine strategies in brain tumor treatment.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Structured Information Matters: Explainable ICD Coding with Patient-Level Knowledge Graphs

[2] Cross-Layer Attention Probing for Fine-Grained Hallucination Detection

[3] Optimal Multi-Task Learning at Regularization Horizon for Speech Translation Task

[4] Creativity Benchmark: A benchmark for marketing creativity for LLM models

[5] CTCC: A Robust and Stealthy Fingerprinting Framework for Large Language Models via Cross-Turn Contextual Correlation Backdoor

[6] MultimodalHugs: Enabling Sign Language Processing in Hugging Face

[7] Temporal Preferences in Language Models for Long-Horizon Assistance

[8] The Non-Determinism of Small LLMs: Evidence of Low Answer Consistency in Repetition Trials of Standard Multiple-Choice Benchmarks

[9] Beyond I’m Sorry, I Can’t: Dissecting Large Language Model Refusal

[10] Assisting Research Proposal Writing with Large Language Models: Evaluation and Refinement

[11] Generating Individual Travel Diaries Using Large Language Models Informed by Census and Land-Use Data

[12] Psychiatry-Bench: A Multi-Task Benchmark for LLMs in Psychiatry

[13] The Thinking Therapist: Training Large Language Models to Deliver Acceptance and Commitment Therapy using Supervised Fine-Tuning and Odds Ratio Policy Optimization

[14] HANRAG: Heuristic Accurate Noise-resistant Retrieval-Augmented Generation for Multi-hop Question Answering

[15] Prominence-aware automatic speech recognition for conversational speech

[16] How Small Transformation Expose the Weakness of Semantic Similarity Measures

[17] Investigating Symbolic Triggers of Hallucination in Gemma Models Across HaluEval and TruthfulQA

[18] ALIGNS: Unlocking nomological networks in psychological measurement through a large language model

[19] DiTTO-LLM: Framework for Discovering Topic-based Technology Opportunities via Large Language Model

[20] BIBERT-Pipe on Biomedical Nested Named Entity Linking at BioASQ 2025

[21] Natural Language Translation of Formal Proofs through Informalization of Proof Steps and Recursive Summarization along Proof Structure

[22] A Role-Aware Multi-Agent Framework for Financial Education Question Answering with LLMs

[23] A meta-analysis on the performance of machine-learning based language models for sentiment analysis

[24] Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning

[25] MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools

[26] Discrimination by LLMs: Cross-lingual Bias Assessment and Mitigation in Decision-Making and Summarisation

[27] HEFT: A Coarse-to-Fine Hierarchy for Enhancing the Efficiency and Accuracy of Language Model Reasoning

[28] Pragmatic Frames Evoked by Gestures: A FrameNet Brasil Approach to Multimodality in Turn Organization

[29] Topic-Guided Reinforcement Learning with LLMs for Enhancing Multi-Document Summarization

[30] Emulating Public Opinion: A Proof-of-Concept of AI-Generated Synthetic Survey Responses for the Chilean Case

[31] Large Language Models Meet Legal Artificial Intelligence: A Survey

[32] CMHG: A Dataset and Benchmark for Headline Generation of Minority Languages in China

[33] Unsupervised Hallucination Detection by Inspecting Reasoning Processes

[34] Multi-Intent Recognition in Dialogue Understanding: A Comparison Between Smaller Open-Source LLMs

[35] Linguistic trajectories of bipolar disorder on social media

[36] !MSA at BAREC Shared Task 2025: Ensembling Arabic Transformers for Readability Assessment

[37] Established Psychometric vs. Ecologically Valid Questionnaires: Rethinking Psychological Assessments in Large Language Models

[38] Querying Climate Knowledge: Semantic Retrieval for Scientific Discovery

[39] Arabic Large Language Models for Medical Text Generation

[40] Scaling Arabic Medical Chatbots Using Synthetic Data: Enhancing Generative AI with Synthetic Patient Records

[41] Population-Aligned Persona Generation for LLM-based Social Simulation

[42] Towards Reliable and Interpretable Document Question Answering via VLMs

[43] Benchmark of stylistic variation in LLM-generated texts

[44] Incongruent Positivity: When Miscalibrated Positivity Undermines Online Supportive Conversations

[45] Beyond Token Limits: Assessing Language Model Performance on Long Text Classification

[46] SI-FACT: Mitigating Knowledge Conflict via Self-Improving Faithfulness-Aware Contrastive Tuning

[47] Dropping Experts, Recombining Neurons: Retraining-Free Pruning for Sparse Mixture-of-Experts LLMs

[48] Is In-Context Learning Learning?

[49] Long Context Automated Essay Scoring with Language Models

[50] RefactorCoderQA: Benchmarking LLMs for Multi-Domain Coding Question Solutions in Cloud and Edge Deployment

[51] DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL

[52] WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers

[53] Slaves to the Law of Large Numbers: An Asymptotic Equipartition Property for Perplexity in Generative Language Models

[54] UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs

[55] Direct Judgement Preference Optimization

[56] Atomic Fact Decomposition Helps Attributed Question Answering

[57] Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance

[58] Polish-English medical knowledge transfer: A new benchmark and results

[59] A 2-step Framework for Automated Literary Translation Evaluation: Its Promises and Pitfalls

[60] Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning

[61] FinMTEB: Finance Massive Text Embedding Benchmark

[62] Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation

[63] Alignment-Augmented Speculative Decoding with Alignment Sampling and Conditional Verification

[64] Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models

[65] Humans Hallucinate Too: Language Models Identify and Correct Subjective Annotation Errors With Label-in-a-Haystack Prompts

[66] NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities

[67] Faster and Better LLMs via Latency-Aware Test-Time Scaling

[68] MEMOIR: Lifelong Model Editing with Minimal Overwrite and Informed Retention for LLMs

[69] Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes

[70] Reframe Your Life Story: Interactive Narrative Therapist and Innovative Moment Assessment with Large Language Models

[71] Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments

[72] Decoding Neural Emotion Patterns through Large Language Model Embeddings

[73] Can Large Language Models Master Complex Card Games?

[74] MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining

[75] Parallel-R1: Towards Parallel Thinking via Reinforcement Learning

cs.CV

[76] Australian Supermarket Object Set (ASOS): A Benchmark Dataset of Physical Objects and 3D Models for Robotics and Computer Vision