Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 83]
cs.CV [Total: 108]
cs.AI [Total: 47]
cs.SD [Total: 4]
cs.LG [Total: 135]
cs.MA [Total: 5]
cs.MM [Total: 1]
eess.AS [Total: 3]
eess.IV [Total: 8]

cs.CL

[1] Activation-Space Personality Steering: Hybrid Layer Selection for Stable Trait Control in LLMs

Pranav Bhandari, Nicolas Fay, Sanjeevan Selvaganapathy, Amitava Datta, Usman Naseem, Mehwish Nasim

Main category: cs.CL

TL;DR: The paper proposes a pipeline to extract and control personality traits in LLMs using Big Five personality framework, enabling precise behavioral steering without compromising model capabilities.

Details

Motivation: Current LLMs exhibit implicit personalities but lack reliable control mechanisms. There's a need for effective behavioral manipulation and better understanding of psychological constructs in LLM representations.

Method: A novel pipeline that extracts hidden state activations using Big Five traits, applies low-rank subspace discovery, identifies optimal layers across architectures, and implements dynamic layer selection for flexible steering.

Result: Personality traits occupy a low-rank shared subspace, and these latent structures can be transformed into actionable steering mechanisms through careful perturbations without affecting fluency, variance, or general capabilities.

Conclusion: The approach bridges psychological theory with practical model alignment, enabling precise control of trait expression in LLM outputs through personality-aware steering mechanisms.

Abstract: Large Language Models exhibit implicit personalities in their generation, but reliably controlling or aligning these traits to meet specific needs remains an open challenge. The need for effective mechanisms for behavioural manipulation of the model during generation is a critical gap in the literature that needs to be fulfilled. Personality-aware LLMs hold a promising direction towards this objective. However, the relationship between these psychological constructs and their representations within LLMs remains underexplored and requires further investigation. Moreover, it is intriguing to understand and study the use of these representations to steer the models’ behaviour. We propose a novel pipeline that extracts hidden state activations from transformer layers using the Big Five Personality Traits (Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticism), which is a comprehensive and empirically validated framework to model human personality applies low-rank subspace discovery methods, and identifies trait-specific optimal layers across different model architectures for robust injection. The resulting personality-aligned directions are then operationalised through a flexible steering framework with dynamic layer selection, enabling precise control of trait expression in LLM outputs. Our findings reveal that personality traits occupy a low-rank shared subspace, and that these latent structures can be transformed into actionable mechanisms for effective steering through careful perturbations without impacting the fluency, variance and general capabilities, helping to bridge the gap between psychological theory and practical model alignment.

[2] TextualVerifier: Verify TextGrad Step-by-Step

Eugenius Mario Situmorang, Adila Alfa Krisnadhi, Ari Wibisono

Main category: cs.CL

TL;DR: TextualVerifier is a self-verification framework that addresses the verification gap in TextGrad by using chain-of-thought reasoning and majority voting with LLMs, improving reasoning validity and optimization results.

Details

Motivation: TextGrad lacks self-verification mechanisms for ensuring reasoning validity in text-based decision making, creating a need for verification frameworks in text-based optimization systems.

Method: A four-stage workflow: chain-of-thought decomposition, variant generation, majority voting, and consensus aggregation. Integrates with TextGrad at loss function and optimization result verification stages using LLMs.

Result: Significant improvements: 29% increase in reasoning step validity, 2.2 percentage point gain (68.2% to 70.4%) with TextGrad integration, and 8.08-10.71 percentage point improvements on various benchmarks with moderate overhead of 5.9 LLM calls.

Conclusion: TextualVerifier is the first self-verification framework for TextGrad that enables more reliable reasoning without requiring numerical gradients, opening new directions for verification in text-based optimization.

Abstract: TextGrad is a novel approach to text-based automatic differentiation that enables composite AI systems to perform optimization without explicit numerical equations. However, it currently lacks self-verification mechanisms that ensure reasoning validity in text-based decision making. This research introduces TextualVerifier, a verification framework that leverages chain-of-thought reasoning and majority voting with large language models to address this verification gap. TextualVerifier implements a four-stage workflow: chain-of-thought decomposition, variant generation, majority voting, and consensus aggregation. It integrates non-invasively with TextGrad at both the loss function and optimization result verification stages. Experimental evaluation using the Gemini 1.5 Pro model is conducted in two phases: (1) standalone evaluation on PRM800K, and (2) integrated evaluation with TextGrad on GPQA-Diamond, MMLU-ML, and MMLU-CP benchmarks. Results show statistically significant improvements (p < 0.001). In phase one, TextualVerifier improves the validity of reasoning steps by 29 percent. In phase two, integration into TextGrad loss function yields a 2.2 percentage point gain from 68.2 to 70.4 percent with a moderate overhead of 5.9 LLM calls on average. Further evaluations of TextualVerifier versioning yield 8.08, 10.71, and 3.92 percentage point improvements on GPQA, MMLU-ML, and MMLU-CP respectively. TextualVerifier thus presents the first self-verification framework for TextGrad through LLM-based techniques without requiring numerical gradients, enabling more reliable reasoning and opening new directions for verification in text-based optimization.

[3] GRDD+: An Extended Greek Dialectal Dataset with Cross-Architecture Fine-tuning Evaluation

Stergios Chatzikyriakidis, Dimitris Papadakis, Sevasti-Ioanna Papaioannou, Erofili Psaltaki

Main category: cs.CL

TL;DR: Extended Greek Dialectal Dataset (GRDD+) with 6.4M words covering 10 Greek varieties, used to fine-tune LLMs and compare with frontier models.

Details

Motivation: To create the first large-scale dataset with extensive Greek dialectal variation and study the impact of quality dialectal data on language models.

Method: Extended existing GRDD dataset with more Cretan, Cypriot, Pontic, Northern Greek data and added six new varieties. Fine-tuned three 8B-parameter models (Llama-3-8B, Llama-3.1-8B, Krikri-8B) and compared with frontier models (Claude-3.7-Sonnet, Gemini-2.5, ChatGPT-5).

Result: Created GRDD+ dataset with 6,374,939 words covering 10 Greek varieties - the largest and most varied Greek dialectal dataset to date. Fine-tuning experiments conducted to evaluate dialectal data impact.

Conclusion: The study presents the first comprehensive Greek dialectal dataset of this scale and demonstrates its utility for improving LLM performance on dialectal tasks through fine-tuning experiments.

Abstract: We present an extended Greek Dialectal Dataset (GRDD+) 1that complements the existing GRDD dataset with more data from Cretan, Cypriot, Pontic and Northern Greek, while we add six new varieties: Greco-Corsican, Griko (Southern Italian Greek), Maniot, Heptanesian, Tsakonian, and Katharevusa Greek. The result is a dataset with total size 6,374,939 words and 10 varieties. This is the first dataset with such variation and size to date. We conduct a number of fine-tuning experiments to see the effect of good quality dialectal data on a number of LLMs. We fine-tune three model architectures (Llama-3-8B, Llama-3.1-8B, Krikri-8B) and compare the results to frontier models (Claude-3.7-Sonnet, Gemini-2.5, ChatGPT-5).

[4] PLLuM: A Family of Polish Large Language Models

Jan Kocoń, Maciej Piasecki, Arkadiusz Janz, Teddy Ferdinan, Łukasz Radliński, Bartłomiej Koptyra, Marcin Oleksy, Stanisław Woźniak, Paweł Walkowiak, Konrad Wojtasik, Julia Moska, Tomasz Naskręt, Bartosz Walkowiak, Mateusz Gniewkowski, Kamil Szyc, Dawid Motyka, Dawid Banach, Jonatan Dalasiński, Ewa Rudnicka, Bartłomiej Alberski, Tomasz Walkowiak, Aleksander Szczęsny, Maciej Markiewicz, Tomasz Bernaś, Hubert Mazur, Kamil Żyta, Mateusz Tykierko, Grzegorz Chodak, Tomasz Kajdanowicz, Przemysław Kazienko, Agnieszka Karlińska, Karolina Seweryn, Anna Kołos, Maciej Chrabąszcz, Katarzyna Lorenc, Aleksandra Krasnodębska, Artur Wilczek, Katarzyna Dziewulska, Paula Betscher, Zofia Cieślińska, Katarzyna Kowol, Daria Mikoś, Maciej Trzciński, Dawid Krutul, Marek Kozłowski, Sławomir Dadas, Rafał Poświata, Michał Perełkiewicz, Małgorzata Grębowiec, Maciej Kazuła, Marcin Białas, Roman Roszko, Danuta Roszko, Jurgita Vaičenonienė, Andrius Utka, Paweł Levchuk, Paweł Kowalski, Irena Prawdzic-Jankowska, Maciej Ogrodniczuk, Monika Borys, Anna Bulińska, Wiktoria Gumienna, Witold Kieraś, Dorota Komosińska, Katarzyna Krasnowska-Kieraś, Łukasz Kobyliński, Martyna Lewandowska, Marek Łaziński, Mikołaj Łątkowski, Dawid Mastalerz, Beata Milewicz, Agnieszka Anna Mykowiecka, Angelika Peljak-Łapińska, Sandra Penno, Zuzanna Przybysz, Michał Rudolf, Piotr Rybak, Karolina Saputa, Aleksandra Tomaszewska, Aleksander Wawer, Marcin Woliński, Joanna Wołoszyn, Alina Wróblewska, Bartosz Żuk, Filip Żarnecki, Konrad Kaczyński, Anna Cichosz, Zuzanna Deckert, Monika Garnys, Izabela Grabarczyk, Wojciech Janowski, Sylwia Karasińska, Aleksandra Kujawiak, Piotr Misztela, Maria Szymańska, Karolina Walkusz, Igor Siek, Jakub Kwiatkowski, Piotr Pęzik

Main category: cs.CL

TL;DR: PLLuM is the largest open-source family of Polish language foundation models developed by Polish research institutions to address the English-centric bias in LLMs and provide culturally relevant AI for Poland.

Details

Motivation: Current LLM development is primarily English-focused, resulting in limited support for other languages like Polish, creating a need for high-quality, transparent, and culturally relevant language models beyond commercial English-centric solutions.

Method: Developed a 140B token Polish text corpus for pre-training, created 77k custom instructions and 100k preference optimization datasets, implemented Responsible AI framework with strict data governance and hybrid safety filtering, and used advanced alignment techniques for base and instruction-tuned model variants.

Result: Successfully created the largest open-source Polish LLM family, demonstrated utility in public administration downstream tasks, and established a foundation for sovereign AI technologies in Poland.

Conclusion: PLLuM addresses the language gap in LLMs by providing specialized Polish models, fosters open research, and strengthens Poland’s AI sovereignty through transparent, culturally relevant language technology.

Abstract: Large Language Models (LLMs) play a central role in modern artificial intelligence, yet their development has been primarily focused on English, resulting in limited support for other languages. We present PLLuM (Polish Large Language Model), the largest open-source family of foundation models tailored specifically for the Polish language. Developed by a consortium of major Polish research institutions, PLLuM addresses the need for high-quality, transparent, and culturally relevant language models beyond the English-centric commercial landscape. We describe the development process, including the construction of a new 140-billion-token Polish text corpus for pre-training, a 77k custom instructions dataset, and a 100k preference optimization dataset. A key component is a Responsible AI framework that incorporates strict data governance and a hybrid module for output correction and safety filtering. We detail the models’ architecture, training procedures, and alignment techniques for both base and instruction-tuned variants, and demonstrate their utility in a downstream task within public administration. By releasing these models publicly, PLLuM aims to foster open research and strengthen sovereign AI technologies in Poland.

[5] STARS: Segment-level Token Alignment with Rejection Sampling in Large Language Models

Mohammad Atif Quamar, Mohammad Areeb, Mikhail Kuznetsov, Muslum Ozgur Ozmen, Z. Berkay Celik

Main category: cs.CL

TL;DR: STARS is a decoding-time algorithm that improves LLM alignment by iteratively sampling, scoring, and rejecting/accepting short token segments, achieving better performance than fine-tuning methods with higher efficiency.

Details

Motivation: Existing alignment methods like fine-tuning are computationally expensive and suboptimal, while inference-time approaches like Best-of-N sampling require impractical computation for optimal alignment.

Method: Segment-level Token Alignment with Rejection Sampling (STARS) - steers model generation by iteratively sampling, scoring, and rejecting/accepting short, fixed-size token segments for early correction of generation path.

Result: STARS outperforms Supervised Fine-Tuning (SFT) by up to 14.9 percentage points and Direct Preference Optimization (DPO) by up to 4.3 percentage points on win-rates across six LLMs, while remaining competitive with Best-of-N baselines.

Conclusion: Granular, reward-guided sampling serves as a generalizable, robust, and efficient alternative to traditional fine-tuning and full-sequence ranking methods for aligning LLMs.

Abstract: Aligning large language models with human values is crucial for their safe deployment; however, existing methods, such as fine-tuning, are computationally expensive and suboptimal. In contrast, inference-time approaches like Best-of-N sampling require practically infeasible computation to achieve optimal alignment. We propose STARS: Segment-level Token Alignment with Rejection Sampling, a decoding-time algorithm that steers model generation by iteratively sampling, scoring, and rejecting/accepting short, fixed-size token segments. This allows for early correction of the generation path, significantly improving computational efficiency and boosting alignment quality. Across a suite of six LLMs, we show that STARS outperforms Supervised Fine-Tuning (SFT) by up to 14.9 percentage points and Direct Preference Optimization (DPO) by up to 4.3 percentage points on win-rates, while remaining highly competitive with strong Best-of-N baselines. Our work establishes granular, reward-guided sampling as a generalizable, robust, and efficient alternative to traditional fine-tuning and full-sequence ranking methods for aligning LLMs.

[6] Divide, Cache, Conquer: Dichotomic Prompting for Efficient Multi-Label LLM-Based Classification

Mikołaj Langner, Jan Eliasz, Ewa Rudnicka, Jan Kocoń

Main category: cs.CL

TL;DR: A method for efficient multi-label text classification using LLMs by reformulating tasks as sequences of yes/no decisions, with prefix caching for efficiency gains and LLM-to-SLM distillation for model training.

Details

Motivation: To address the inefficiency of traditional multi-label classification approaches with LLMs, particularly for short-text inference, while maintaining accuracy.

Method: Reformulate multi-label classification as independent dichotomic (yes/no) queries for each target dimension, implement prefix caching mechanism, and use LLM-to-SLM distillation where a large annotator model provides multiple annotations to fine-tune smaller models.

Result: Fine-tuned models show significant improvements over zero-shot baselines, especially on training dimensions, with substantial efficiency gains for short-text inference without accuracy loss.

Conclusion: Decomposing multi-label classification into dichotomic queries combined with distillation and cache-aware inference provides a scalable and effective framework for LLM-based classification, applicable across domains beyond affective states.

Abstract: We introduce a method for efficient multi-label text classification with large language models (LLMs), built on reformulating classification tasks as sequences of dichotomic (yes/no) decisions. Instead of generating all labels in a single structured response, each target dimension is queried independently, which, combined with a prefix caching mechanism, yields substantial efficiency gains for short-text inference without loss of accuracy. To demonstrate the approach, we focus on affective text analysis, covering 24 dimensions including emotions and sentiment. Using LLM-to-SLM distillation, a powerful annotator model (DeepSeek-V3) provides multiple annotations per text, which are aggregated to fine-tune smaller models (HerBERT-Large, CLARIN-1B, PLLuM-8B, Gemma3-1B). The fine-tuned models show significant improvements over zero-shot baselines, particularly on the dimensions seen during training. Our findings suggest that decomposing multi-label classification into dichotomic queries, combined with distillation and cache-aware inference, offers a scalable and effective framework for LLM-based classification. While we validate the method on affective states, the approach is general and applicable across domains.

[7] CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese

Dazhong Chen, Yi-Cheng Lin, Yuchen Huang, Ziwei Gong, Di Jiang, Zeying Xie, Yi R., Fung

Main category: cs.CL

TL;DR: CantoASR is a collaborative ASR-LALM framework that combines forced alignment, LoRA-finetuned Whisper, and instruction-tuned Qwen-Audio to improve Cantonese speech recognition by integrating acoustic tonal cues with large audio-language model reasoning.

Details

Motivation: Low-resource Cantonese ASR faces challenges due to limited annotated data, six lexical tones, tone sandhi, and accent variation, leading to high word error rates in existing models like Whisper.

Method: Integrates forced alignment for acoustic feature extraction, LoRA-finetuned Whisper for improved tone discrimination, and instruction-tuned Qwen-Audio for prosody-aware correction in a collaborative ASR-LALM framework.

Result: Substantial Character Error Rate (CER) gains over Whisper-Large-V3 on spontaneous Cantonese data.

Conclusion: Integrating acoustic cues with LALM reasoning provides a scalable strategy for low-resource tonal and dialectal ASR.

Abstract: Automatic speech recognition (ASR) is critical for language accessibility, yet low-resource Cantonese remains challenging due to limited annotated data, six lexical tones, tone sandhi, and accent variation. Existing ASR models, such as Whisper, often suffer from high word error rates. Large audio-language models (LALMs), in contrast, can leverage broader contextual reasoning but still require explicit tonal and prosodic acoustic cues. We introduce CantoASR, a collaborative ASR-LALM error correction framework that integrates forced alignment for acoustic feature extraction, a LoRA-finetuned Whisper for improved tone discrimination, and an instruction-tuned Qwen-Audio for prosody-aware correction. Evaluations on spontaneous Cantonese data show substantial CER gains over Whisper-Large-V3. These findings suggest that integrating acoustic cues with LALM reasoning provides a scalable strategy for low-resource tonal and dialectal ASR.

[8] Evaluating Machine Translation Datasets for Low-Web Data Languages: A Gendered Lens

Hellina Hailu Nigatu, Bethelhem Yemane Mamo, Bontu Fufa Balcha, Debora Taye Tesfaye, Elbethel Daniel Zewdie, Ikram Behiru Nesiru, Jitu Ewnetu Hailu, Senait Mengesha Yayo

Main category: cs.CL

TL;DR: Analysis of Machine Translation datasets for three low-resourced African languages reveals significant gender representation issues, including male gender skew, harmful stereotypes, and toxic content against women.

Details

Motivation: To investigate the quality of Machine Translation datasets for low-resourced languages (Afan Oromo, Amharic, Tigrinya), focusing on gender representation issues and the risk of building poor-performing technologies that perpetuate societal biases.

Method: Analyzed MT datasets for three low-resourced languages, examining gender representation through names, grammatical gender of verbs, stereotypical depictions, and identification of harmful/toxic content.

Result: Found large skew towards male gender in all aspects, domain mismatch between training data (political/religious) and benchmarks (news/health/sports), and harmful depictions against women that were more prominent in languages with larger datasets.

Conclusion: Quantity of data does not guarantee quality; datasets for low-resourced languages contain significant gender biases and harmful content that need early mitigation to prevent perpetuating societal biases.

Abstract: As low-resourced languages are increasingly incorporated into NLP research, there is an emphasis on collecting large-scale datasets. But in prioritizing quantity over quality, we risk 1) building language technologies that perform poorly for these languages and 2) producing harmful content that perpetuates societal biases. In this paper, we investigate the quality of Machine Translation (MT) datasets for three low-resourced languages–Afan Oromo, Amharic, and Tigrinya, with a focus on the gender representation in the datasets. Our findings demonstrate that while training data has a large representation of political and religious domain text, benchmark datasets are focused on news, health, and sports. We also found a large skew towards the male gender–in names of persons, the grammatical gender of verbs, and in stereotypical depictions in the datasets. Further, we found harmful and toxic depictions against women, which were more prominent for the language with the largest amount of data, underscoring that quantity does not guarantee quality. We hope that our work inspires further inquiry into the datasets collected for low-resourced languages and prompts early mitigation of harmful content. WARNING: This paper contains discussion of NSFW content that some may find disturbing.

[9] GRAD: Graph-Retrieved Adaptive Decoding for Hallucination Mitigation

Manh Nguyen, Sunil Gupta, Dai Do, Hung Le

Main category: cs.CL

TL;DR: GRAD is a decoding-time method that mitigates LLM hallucinations by constructing token transition graphs from retrieved corpus evidence and adaptively fusing them with model logits during generation.

Details

Motivation: Existing hallucination mitigation approaches rely on external knowledge sources through fragile prompting or costly symbolic knowledge integration, creating a need for lightweight, plug-and-play alternatives.

Method: Constructs sparse token transition graphs by accumulating next-token logits across retrieved corpus in single forward pass, then max-normalizes and adaptively fuses graph-retrieved logits with model logits during decoding.

Result: Achieves up to 9.7% higher intrinsic accuracy, 8.6% lower hallucination rates, and 6.9% greater correctness compared to greedy decoding across three models and multiple QA benchmarks.

Conclusion: GRAD provides effective hallucination mitigation using statistical evidence from corpus-level token transitions, offering lightweight alternative to contrastive decoding and knowledge graph augmentation.

Abstract: Hallucination mitigation remains a persistent challenge for large language models (LLMs), even as model scales grow. Existing approaches often rely on external knowledge sources, such as structured databases or knowledge graphs, accessed through prompting or retrieval. However, prompt-based grounding is fragile and domain-sensitive, while symbolic knowledge integration incurs heavy retrieval and formatting costs. Motivated by knowledge graphs, we introduce Graph-Retrieved Adaptive Decoding (GRAD), a decoding-time method that grounds generation in corpus-derived evidence without retraining. GRAD constructs a sparse token transition graph by accumulating next-token logits across a small retrieved corpus in a single forward pass. During decoding, graph-retrieved logits are max-normalized and adaptively fused with model logits to favor high-evidence continuations while preserving fluency. Across three models and a range of question-answering benchmarks spanning intrinsic, extrinsic hallucination, and factuality tasks, GRAD consistently surpasses baselines, achieving up to 9.7$%$ higher intrinsic accuracy, 8.6$%$ lower hallucination rates, and 6.9$%$ greater correctness compared to greedy decoding, while attaining the highest truth–informativeness product score among all methods. GRAD offers a lightweight, plug-and-play alternative to contrastive decoding and knowledge graph augmentation, demonstrating that statistical evidence from corpus-level token transitions can effectively steer generation toward more truthful and verifiable outputs.

[10] Context informs pragmatic interpretation in vision-language models

Alvin Wei Ming Tan, Ben Prystawski, Veronica Boyce, Michael C. Frank

Main category: cs.CL

TL;DR: Models perform poorly without context but improve significantly with relevant context in iterated reference games, though still lag behind humans.

Details

Motivation: To test agents' ability for context-sensitive pragmatic reasoning in multi-turn linguistic environments through iterated reference games.

Method: Testing humans and vision-language models on iterated reference games with varying context (amount, order, relevance).

Result: Models performed above chance but worse than humans without relevant context; performance increased dramatically with relevant context over trials.

Conclusion: Few-shot reference games with abstract referents remain challenging for machine learning models despite improvements with context.

Abstract: Iterated reference games - in which players repeatedly pick out novel referents using language - present a test case for agents’ ability to perform context-sensitive pragmatic reasoning in multi-turn linguistic environments. We tested humans and vision-language models on trials from iterated reference games, varying the given context in terms of amount, order, and relevance. Without relevant context, models were above chance but substantially worse than humans. However, with relevant context, model performance increased dramatically over trials. Few-shot reference games with abstract referents remain a difficult task for machine learning models.

[11] BAPPA: Benchmarking Agents, Plans, and Pipelines for Automated Text-to-SQL Generation

Fahim Ahmed, Md Mubtasim Ahasan, Jahir Sadik Monon, Muntasir Wahed, M Ashraful Amin, A K M Mahbubur Rahman, Amin Ahsan Ali

Main category: cs.CL

TL;DR: Multi-agent LLM pipelines improve SQL generation from natural language, with discussion and planner-coder approaches boosting small model performance by up to 10.6% and achieving 56.4% accuracy.

Details

Motivation: Existing LLMs struggle with SQL generation due to large schemas and complex reasoning, while smaller efficient models are overlooked in favor of impractical complex pipelines.

Method: Three multi-agent pipelines: (1) iterative discussion with critique and synthesis, (2) planner-coder with stepwise generation plans, (3) coder-aggregator with independent generation and selection.

Result: Multi-agent discussion improved Qwen2.5-7b-Instruct by 10.6% in execution accuracy. Planner-coder pipeline achieved highest accuracy of 56.4% with DeepSeek-R1-32B and QwQ-32B planners boosting Gemma 3 27B IT from 52.4%.

Conclusion: Multi-agent approaches, especially planner-coder pipelines, effectively enhance SQL generation performance for both small and large models, making efficient models more practical for text-to-SQL tasks.

Abstract: Text-to-SQL systems provide a natural language interface that can enable even laymen to access information stored in databases. However, existing Large Language Models (LLM) struggle with SQL generation from natural instructions due to large schema sizes and complex reasoning. Prior work often focuses on complex, somewhat impractical pipelines using flagship models, while smaller, efficient models remain overlooked. In this work, we explore three multi-agent LLM pipelines, with systematic performance benchmarking across a range of small to large open-source models: (1) Multi-agent discussion pipeline, where agents iteratively critique and refine SQL queries, and a judge synthesizes the final answer; (2) Planner-Coder pipeline, where a thinking model planner generates stepwise SQL generation plans and a coder synthesizes queries; and (3) Coder-Aggregator pipeline, where multiple coders independently generate SQL queries, and a reasoning agent selects the best query. Experiments on the Bird-Bench Mini-Dev set reveal that Multi-Agent discussion can improve small model performance, with up to 10.6% increase in Execution Accuracy for Qwen2.5-7b-Instruct seen after three rounds of discussion. Among the pipelines, the LLM Reasoner-Coder pipeline yields the best results, with DeepSeek-R1-32B and QwQ-32B planners boosting Gemma 3 27B IT accuracy from 52.4% to the highest score of 56.4%. Codes are available at https://github.com/treeDweller98/bappa-sql.

[12] The Human Flourishing Geographic Index: A County-Level Dataset for the United States, 2013–2023

Stefano M. Iacus, Devika Jain, Andrea Nasuto, Giuseppe Porro, Marcello Carammia, Andrea Vezzulli

Main category: cs.CL

TL;DR: The paper introduces the Human Flourishing Geographic Index (HFGI), a high-resolution measure of human flourishing derived from 2.6 billion geolocated tweets using fine-tuned LLMs to analyze 48 indicators across happiness, health, purpose, virtue, relationships, and financial stability.

Details

Motivation: Existing measures of human flourishing lack fine spatial and temporal resolution needed to understand societal well-being beyond economic indicators at detailed geographic and temporal scales.

Method: Analyzed approximately 2.6 billion geolocated U.S. tweets (2013-2023) using fine-tuned large language models to classify expressions across 48 indicators aligned with Harvard’s Global Flourishing Study framework, plus attitudes towards migration and perception of corruption.

Result: Created monthly and yearly county- and state-level indicators of flourishing-related discourse that are validated to accurately represent underlying constructs and show expected correlations with established indicators.

Conclusion: The HFGI enables multidisciplinary analyses of well-being, inequality, and social change at unprecedented resolution, providing insights into human flourishing dynamics across the United States over the past decade through social media discourse.

Abstract: Quantifying human flourishing, a multidimensional construct including happiness, health, purpose, virtue, relationships, and financial stability, is critical for understanding societal well-being beyond economic indicators. Existing measures often lack fine spatial and temporal resolution. Here we introduce the Human Flourishing Geographic Index (HFGI), derived from analyzing approximately 2.6 billion geolocated U.S. tweets (2013-2023) using fine-tuned large language models to classify expressions across 48 indicators aligned with Harvard’s Global Flourishing Study framework plus attitudes towards migration and perception of corruption. The dataset offers monthly and yearly county- and state-level indicators of flourishing-related discourse, validated to confirm that the measures accurately represent the underlying constructs and show expected correlations with established indicators. This resource enables multidisciplinary analyses of well-being, inequality, and social change at unprecedented resolution, offering insights into the dynamics of human flourishing as reflected in social media discourse across the United States over the past decade.

[13] Direct Semantic Communication Between Large Language Models via Vector Translation

Fu-Chun Yang, Jason Eshraghian

Main category: cs.CL

TL;DR: The paper proposes using vector translations to enable direct semantic exchange between LLMs, allowing cross-model latent communication instead of token-based messaging.

Details

Motivation: Current multi-agent LLM systems pass messages as plain tokens, discarding latent semantics and constraining information transfer while adding computational overhead.

Method: A dual-encoder translator trained between Llama-2-7B and Mistral-7B-Instruct learns mappings for direct semantic exchange between representation spaces, with translated vectors injected at 30% blending strength.

Result: The translator achieves 0.538 average cosine alignment, preserves computational stability, and shows a 2.01:1 transfer asymmetry favoring general-purpose over instruction-tuned models.

Conclusion: Cross-model latent communication is feasible, enabling collaborative AI systems that share meaning rather than tokens while maintaining computational stability.

Abstract: In multi-agent settings, such as debate, reflection, or tool-calling, large language models (LLMs) pass messages as plain tokens, discarding most latent semantics. This constrains information transfer and adds unnecessary computational overhead. We form a latent bridge via vector translations, which use learned mappings that enable direct semantic exchange between representation spaces. A dual-encoder translator trained between Llama-2-7B and Mistral-7B-Instruct attains an average cosine alignment of 0.538. Injecting the translated vectors at 30 percent blending strength steers the target model’s generation without destabilizing logits. Bidirectional evaluation shows a 2.01:1 transfer asymmetry, indicating that general-purpose models yield more transferable representations than instruction-tuned variants. This conservative injection preserves computational stability while demonstrating that cross-model latent communication is feasible, enabling collaborative AI systems that share meaning rather than tokens.

[14] Computational Turing Test Reveals Systematic Differences Between Human and AI Language

Nicolò Pagan, Petter Törnberg, Christopher A. Bail, Anikó Hannák, Christopher Barrie

Main category: cs.CL

TL;DR: A computational Turing test framework validates LLM-generated text using aggregate metrics and linguistic features, revealing that even calibrated LLMs remain distinguishable from human text, with trade-offs between human-likeness and semantic fidelity.

Details

Motivation: To address the lack of robust validation tools for assessing the realism of LLM-generated text in social science simulations, as current methods rely on unreliable human-judgment-based evaluations.

Method: Introduced a computational Turing test framework combining BERT-based detectability, semantic similarity, stylistic markers, and topical patterns; systematically compared 9 LLMs across 5 calibration strategies on X, Bluesky, and Reddit data.

Result: LLM outputs remain clearly distinguishable from human text even after calibration, especially in affective tone; instruction-tuned models underperform base models; scaling model size doesn’t improve human-likeness; trade-off exists between human-likeness and semantic fidelity.

Conclusion: Provides a scalable validation framework for LLM simulations but cautions about current limitations in capturing human communication, highlighting the trade-off between human-likeness and semantic accuracy.

Abstract: Large language models (LLMs) are increasingly used in the social sciences to simulate human behavior, based on the assumption that they can generate realistic, human-like text. Yet this assumption remains largely untested. Existing validation efforts rely heavily on human-judgment-based evaluations – testing whether humans can distinguish AI from human output – despite evidence that such judgments are blunt and unreliable. As a result, the field lacks robust tools for assessing the realism of LLM-generated text or for calibrating models to real-world data. This paper makes two contributions. First, we introduce a computational Turing test: a validation framework that integrates aggregate metrics (BERT-based detectability and semantic similarity) with interpretable linguistic features (stylistic markers and topical patterns) to assess how closely LLMs approximate human language within a given dataset. Second, we systematically compare nine open-weight LLMs across five calibration strategies – including fine-tuning, stylistic prompting, and context retrieval – benchmarking their ability to reproduce user interactions on X (formerly Twitter), Bluesky, and Reddit. Our findings challenge core assumptions in the literature. Even after calibration, LLM outputs remain clearly distinguishable from human text, particularly in affective tone and emotional expression. Instruction-tuned models underperform their base counterparts, and scaling up model size does not enhance human-likeness. Crucially, we identify a trade-off: optimizing for human-likeness often comes at the cost of semantic fidelity, and vice versa. These results provide a much-needed scalable framework for validation and calibration in LLM simulations – and offer a cautionary note about their current limitations in capturing human communication.

[15] Abductive Inference in Retrieval-Augmented Language Models: Generating and Validating Missing Premises

Shiyin Lin

Main category: cs.CL

TL;DR: Proposes integrating abductive inference into RAG systems to handle incomplete evidence by generating and validating missing premises, improving accuracy and faithfulness.

Details

Motivation: RAG systems often fail when retrieved evidence is incomplete, leaving gaps in reasoning that abductive inference can help bridge.

Method: Detects insufficient evidence, generates candidate missing premises, and validates them through consistency and plausibility checks.

Result: Experimental results show improved answer accuracy and reasoning faithfulness on abductive reasoning and multi-hop QA benchmarks.

Conclusion: Abductive inference is a promising direction for enhancing robustness and explainability of RAG systems.

Abstract: Large Language Models (LLMs) enhanced with retrieval – commonly referred to as Retrieval-Augmented Generation (RAG) – have demonstrated strong performance in knowledge-intensive tasks. However, RAG pipelines often fail when retrieved evidence is incomplete, leaving gaps in the reasoning process. In such cases, \emph{abductive inference} – the process of generating plausible missing premises to explain observations – offers a principled approach to bridge these gaps. In this paper, we propose a framework that integrates abductive inference into retrieval-augmented LLMs. Our method detects insufficient evidence, generates candidate missing premises, and validates them through consistency and plausibility checks. Experimental results on abductive reasoning and multi-hop QA benchmarks show that our approach improves both answer accuracy and reasoning faithfulness. This work highlights abductive inference as a promising direction for enhancing the robustness and explainability of RAG systems.

[16] WST: Weakly Supervised Transducer for Automatic Speech Recognition

Dongji Gao, Chenda Liao, Changliang Liu, Matthew Wiesner, Leibny Paola Garcia, Daniel Povey, Sanjeev Khudanpur, Jian Wu

Main category: cs.CL

TL;DR: Proposes Weakly Supervised Transducer (WST) for ASR that handles noisy transcripts without needing confidence scores or pre-trained models, maintaining performance with up to 70% transcription errors and outperforming CTC-based methods.

Details

Motivation: RNN-T models require large-scale high-quality annotated data which is costly and difficult to obtain, creating a need for methods that can work with imperfect transcripts.

Method: WST integrates a flexible training graph designed to robustly handle errors in transcripts without requiring additional confidence estimation or auxiliary pre-trained models.

Result: WST maintains performance with transcription error rates up to 70% and consistently outperforms existing CTC-based weakly supervised approaches like BTC and OTC on synthetic and industrial datasets.

Conclusion: WST demonstrates practical utility and robustness in realistic ASR settings, offering an effective solution for training with imperfect transcripts.

Abstract: The Recurrent Neural Network-Transducer (RNN-T) is widely adopted in end-to-end (E2E) automatic speech recognition (ASR) tasks but depends heavily on large-scale, high-quality annotated data, which are often costly and difficult to obtain. To mitigate this reliance, we propose a Weakly Supervised Transducer (WST), which integrates a flexible training graph designed to robustly handle errors in the transcripts without requiring additional confidence estimation or auxiliary pre-trained models. Empirical evaluations on synthetic and industrial datasets reveal that WST effectively maintains performance even with transcription error rates of up to 70%, consistently outperforming existing Connectionist Temporal Classification (CTC)-based weakly supervised approaches, such as Bypass Temporal Classification (BTC) and Omni-Temporal Classification (OTC). These results demonstrate the practical utility and robustness of WST in realistic ASR settings. The implementation will be publicly available.

[17] T-FIX: Text-Based Explanations with Features Interpretable to eXperts

Shreya Havaldar, Helen Jin, Chaehyeon Kim, Anton Xue, Weiqiu You, Marco Gatti, Bhuvnesh Jain, Helen Qu, Daniel A Hashimoto, Amin Madani, Rajat Deo, Sameed Ahmed M. Khatana, Gary E. Weissman, Lyle Ungar, Eric Wong

Main category: cs.CL

TL;DR: T-FIX is a benchmark for evaluating LLM explanations’ alignment with expert intuition across seven knowledge-intensive domains, addressing limitations of current evaluation schemes.

Details

Motivation: Current LLM explanation evaluations focus on plausibility or internal faithfulness but fail to assess whether explanations truly align with expert reasoning in knowledge-intensive domains like medicine, astronomy, and therapy.

Method: Developed T-FIX benchmark spanning seven knowledge-intensive domains in collaboration with domain experts, creating novel metrics to measure LLM explanation alignment with expert judgment.

Result: The paper introduces a formal criterion for expert alignment and provides a comprehensive benchmark with specialized metrics for evaluating LLM explanations in expert domains.

Conclusion: T-FIX addresses a critical gap in LLM explanation evaluation by focusing on expert alignment rather than just plausibility, enabling better assessment of LLM performance in knowledge-intensive professional settings.

Abstract: As LLMs are deployed in knowledge-intensive settings (e.g., surgery, astronomy, therapy), users expect not just answers, but also meaningful explanations for those answers. In these settings, users are often domain experts (e.g., doctors, astrophysicists, psychologists) who require explanations that reflect expert-level reasoning. However, current evaluation schemes primarily emphasize plausibility or internal faithfulness of the explanation, which fail to capture whether the content of the explanation truly aligns with expert intuition. We formalize expert alignment as a criterion for evaluating explanations with T-FIX, a benchmark spanning seven knowledge-intensive domains. In collaboration with domain experts, we develop novel metrics to measure the alignment of LLM explanations with expert judgment.

[18] Plan of Knowledge: Retrieval-Augmented Large Language Models for Temporal Knowledge Graph Question Answering

Xinying Qian, Ying Zhang, Yu Zhao, Baohang Zhou, Xuhui Sui, Xiaojie Yuan

Main category: cs.CL

TL;DR: PoK framework enhances LLMs’ temporal reasoning for TKGQA by decomposing questions into sub-objectives and using contrastive temporal retrieval from TKGs.

Details

Motivation: Existing TKGQA methods fail to fully understand complex temporal constraints, while LLMs have strong semantic understanding but limited temporal reasoning abilities and suffer from hallucination.

Method: Proposes Plan of Knowledge (PoK) framework with contrastive temporal retriever - decomposes questions into sub-objectives and retrieves temporally aligned facts from Temporal Knowledge Store.

Result: Significantly improves retrieval precision and reasoning accuracy on four TKGQA datasets, surpassing state-of-the-art methods by up to 56.0%.

Conclusion: PoK effectively enhances interpretability and factual consistency of temporal reasoning by combining structured planning with temporal knowledge retrieval.

Abstract: Temporal Knowledge Graph Question Answering (TKGQA) aims to answer time-sensitive questions by leveraging factual information from Temporal Knowledge Graphs (TKGs). While previous studies have employed pre-trained TKG embeddings or graph neural networks to inject temporal knowledge, they fail to fully understand the complex semantic information of time constraints. Recently, Large Language Models (LLMs) have shown remarkable progress, benefiting from their strong semantic understanding and reasoning generalization capabilities. However, their temporal reasoning ability remains limited. LLMs frequently suffer from hallucination and a lack of knowledge. To address these limitations, we propose the Plan of Knowledge framework with a contrastive temporal retriever, which is named PoK. Specifically, the proposed Plan of Knowledge module decomposes a complex temporal question into a sequence of sub-objectives from the pre-defined tools, serving as intermediate guidance for reasoning exploration. In parallel, we construct a Temporal Knowledge Store (TKS) with a contrastive retrieval framework, enabling the model to selectively retrieve semantically and temporally aligned facts from TKGs. By combining structured planning with temporal knowledge retrieval, PoK effectively enhances the interpretability and factual consistency of temporal reasoning. Extensive experiments on four benchmark TKGQA datasets demonstrate that PoK significantly improves the retrieval precision and reasoning accuracy of LLMs, surpassing the performance of the state-of-the-art TKGQA methods by 56.0% at most.

[19] The truth is no diaper: Human and AI-generated associations to emotional words

Špela Vintar, Jan Jona Javoršek

Main category: cs.CL

TL;DR: Comparison of word associations between humans and LLMs shows moderate overlap, with LLM associations being more emotionally amplified, predictable, and less creative than human associations.

Details

Motivation: To understand if LLMs generate word associations similarly to humans, particularly for emotionally loaded words, and to explore the creative aspects of associative thinking.

Method: Comparative analysis of associative behavior by examining responses to emotionally loaded word cues from both human participants and large language models.

Result: LLM-human association overlap is moderate; LLMs amplify emotional load of stimuli and produce more predictable, less creative associations compared to humans.

Conclusion: LLMs demonstrate different associative patterns than humans, with reduced creativity and heightened emotional amplification in their word associations.

Abstract: Human word associations are a well-known method of gaining insight into the internal mental lexicon, but the responses spontaneously offered by human participants to word cues are not always predictable as they may be influenced by personal experience, emotions or individual cognitive styles. The ability to form associative links between seemingly unrelated concepts can be the driving mechanisms of creativity. We perform a comparison of the associative behaviour of humans compared to large language models. More specifically, we explore associations to emotionally loaded words and try to determine whether large language models generate associations in a similar way to humans. We find that the overlap between humans and LLMs is moderate, but also that the associations of LLMs tend to amplify the underlying emotional load of the stimulus, and that they tend to be more predictable and less creative than human ones.

[20] Improving the Performance of Radiology Report De-identification with Large-Scale Training and Benchmarking Against Cloud Vendor Methods

Eva Prakash, Maayane Attias, Pierre Chambon, Justin Xu, Steven Truong, Jean-Benoit Delbrouck, Tessa Cook, Curtis Langlotz

Main category: cs.CL

TL;DR: Transformer-based model for de-identifying radiology reports achieves state-of-the-art PHI detection performance, outperforming commercial systems and demonstrating robust cross-institutional generalization.

Details

Motivation: To enhance automated de-identification of radiology reports by scaling transformer models with large training datasets and benchmarking against commercial cloud vendor systems for PHI detection.

Method: Fine-tuned transformer-based PHI de-identification pipeline on two large annotated radiology corpora from Stanford, introduced additional AGE category, evaluated on Stanford and Penn test sets, assessed synthetic PHI generation stability, and compared against commercial systems.

Result: Achieved overall F1 scores of 0.973 (Penn) and 0.996 (Stanford), outperforming prior models. Synthetic PHI evaluation showed consistent detectability (F1: 0.959). Model outperformed all vendor systems (F1: 0.960 vs. 0.632-0.754).

Conclusion: Transformer-based de-identification model trained on diverse radiology datasets establishes new benchmark for secure clinical text processing, outperforming academic and commercial systems while preserving data utility.

Abstract: Objective: To enhance automated de-identification of radiology reports by scaling transformer-based models through extensive training datasets and benchmarking performance against commercial cloud vendor systems for protected health information (PHI) detection. Materials and Methods: In this retrospective study, we built upon a state-of-the-art, transformer-based, PHI de-identification pipeline by fine-tuning on two large annotated radiology corpora from Stanford University, encompassing chest X-ray, chest CT, abdomen/pelvis CT, and brain MR reports and introducing an additional PHI category (AGE) into the architecture. Model performance was evaluated on test sets from Stanford and the University of Pennsylvania (Penn) for token-level PHI detection. We further assessed (1) the stability of synthetic PHI generation using a “hide-in-plain-sight” method and (2) performance against commercial systems. Precision, recall, and F1 scores were computed across all PHI categories. Results: Our model achieved overall F1 scores of 0.973 on the Penn dataset and 0.996 on the Stanford dataset, outperforming or maintaining the previous state-of-the-art model performance. Synthetic PHI evaluation showed consistent detectability (overall F1: 0.959 [0.958-0.960]) across 50 independently de-identified Penn datasets. Our model outperformed all vendor systems on synthetic Penn reports (overall F1: 0.960 vs. 0.632-0.754). Discussion: Large-scale, multimodal training improved cross-institutional generalization and robustness. Synthetic PHI generation preserved data utility while ensuring privacy. Conclusion: A transformer-based de-identification model trained on diverse radiology datasets outperforms prior academic and commercial systems in PHI detection and establishes a new benchmark for secure clinical text processing.

[21] A Characterization of List Language Identification in the Limit

Moses Charikar, Chirag Pabbaraju, Ambuj Tewari

Main category: cs.CL

TL;DR: The paper studies k-list language identification in the limit, giving an exact characterization of when collections of languages can be identified with k guesses per step, and shows connections to statistical learning rates.

Details

Motivation: Classical language identification in the limit is impossible for most interesting language collections. Recent positive results for language generation motivate revisiting identification with the additional power of producing multiple guesses.

Method: The authors develop a recursive characterization based on Angluin’s original work, showing that k-list identification is equivalent to decomposing the language collection into k subcollections that are each identifiable with single guesses.

Result: An exact characterization of collections that can be k-list identified in the limit, and establishment of exponential identification rates in the statistical setting when the collection is k-list identifiable.

Conclusion: K-list identification is possible exactly when the language collection decomposes into k identifiable subcollections, and when possible, exponential identification rates are achievable and optimal.

Abstract: We study the problem of language identification in the limit, where given a sequence of examples from a target language, the goal of the learner is to output a sequence of guesses for the target language such that all the guesses beyond some finite time are correct. Classical results of Gold showed that language identification in the limit is impossible for essentially any interesting collection of languages. Later, Angluin gave a precise characterization of language collections for which this task is possible. Motivated by recent positive results for the related problem of language generation, we revisit the classic language identification problem in the setting where the learner is given the additional power of producing a list of $k$ guesses at each time step. The goal is to ensure that beyond some finite time, one of the guesses is correct at each time step. We give an exact characterization of collections of languages that can be $k$-list identified in the limit, based on a recursive version of Angluin’s characterization (for language identification with a list of size $1$). This further leads to a conceptually appealing characterization: A language collection can be $k$-list identified in the limit if and only if the collection can be decomposed into $k$ collections of languages, each of which can be identified in the limit (with a list of size $1$). We also use our characterization to establish rates for list identification in the statistical setting where the input is drawn as an i.i.d. stream from a distribution supported on some language in the collection. Our results show that if a collection is $k$-list identifiable in the limit, then the collection can be $k$-list identified at an exponential rate, and this is best possible. On the other hand, if a collection is not $k$-list identifiable in the limit, then it cannot be $k$-list identified at any rate that goes to zero.

[22] Batch Prompting Suppresses Overthinking Reasoning Under Constraint: How Batch Prompting Suppresses Overthinking in Reasoning Models

Wenmo Qiu, Saurabh Srivastava

Main category: cs.CL

TL;DR: Batch prompting in LLMs improves reasoning accuracy while reducing token usage 3x-5x by regularizing model behavior, suppressing overthinking, and enabling emergent collective reasoning patterns.

Details

Motivation: To explore batch prompting's underappreciated benefits beyond just inference cost amortization, specifically its regularization effects on multi-step reasoning in Large Reasoning Models.

Method: Comprehensive study across 13 diverse benchmarks analyzing batched inference, behavioral analysis of model outputs, and examination of collective reasoning patterns.

Result: Batching improves accuracy while reducing reasoning token usage by 3x-5x, suppresses overthinking, reduces hedging language, encourages decisive answers, and enables emergent collective reasoning where models generalize patterns from easier to harder examples.

Conclusion: Batching serves as a powerful inference-time regularizer for more efficient and reliable LLM reasoning, not just a throughput optimization.

Abstract: Recent work has explored batch prompting as a strategy to amortize inference cost in large language models (LLMs). In this paper, we show that batching offers an additional, underappreciated benefit: it regularizes model behavior during multi-step reasoning for Large Reasoning Models (LRMs). We conduct a comprehensive study across 13 diverse benchmarks and observe that batching improves accuracy while substantially reducing reasoning token usage, often by 3x-5x. Through detailed behavioral analysis, we find that batching suppresses overthinking, reduces hedging language (e.g., repetitive self-corrections), and encourages more decisive answers. Surprisingly, we also observe emergent collective effects in batched inference: models often generalize patterns from earlier examples to solve harder ones in the same batch. These findings position batching not just as a throughput optimization, but as a powerful inference-time regularizer for more efficient and reliable LLM reasoning.

[23] RIDE: Difficulty Evolving Perturbation with Item Response Theory for Mathematical Reasoning

Xinyuan Li, Murong Xu, Wenbiao Tao, Hanlun Zhu, Yike Zhao, Jipeng Zhang, Yunshi Lan

Main category: cs.CL

TL;DR: RIDE is an adversarial question-rewriting framework that uses Item Response Theory to create more challenging mathematical problems, revealing LLMs’ limited robustness in mathematical reasoning.

Details

Motivation: Current LLM evaluations for mathematical reasoning may be inflated by data leakage or pattern matching rather than genuine reasoning, requiring adversarial perturbation methods to measure true reasoning ability.

Method: Leverages Item Response Theory to measure question difficulty, uses 35 LLMs as simulated students to build a difficulty ranker, and employs reinforcement learning to guide question rewriting across difficulty levels.

Result: RIDE-generated perturbed versions degrade advanced LLM performance by an average 21.73% across 26 models, exposing limited robustness in mathematical reasoning.

Conclusion: The framework successfully creates intrinsically more challenging mathematical problems and validates the effectiveness of adversarial evaluation in measuring true mathematical reasoning capabilities of LLMs.

Abstract: Large language models (LLMs) achieve high performance on mathematical reasoning, but these results can be inflated by training data leakage or superficial pattern matching rather than genuine reasoning. To this end, an adversarial perturbation-based evaluation is needed to measure true mathematical reasoning ability. Current rule-based perturbation methods often generate ill-posed questions and impede the systematic evaluation of question difficulty and the evolution of benchmarks. To bridge this gap, we propose RIDE, a novel adversarial question-rewriting framework that leverages Item Response Theory (IRT) to rigorously measure question difficulty and to generate intrinsically more challenging, well-posed variations of mathematical problems. We employ 35 LLMs to simulate students and build a difficulty ranker from their responses. This ranker provides a reward signal during reinforcement learning and guides a question-rewriting model to reformulate existing questions across difficulty levels. Applying RIDE to competition-level mathematical benchmarks yields perturbed versions that degrade advanced LLM performance, with experiments showing an average 21.73% drop across 26 models, thereby exposing limited robustness in mathematical reasoning and confirming the validity of our evaluation approach.

[24] Trustworthy LLM-Mediated Communication: Evaluating Information Fidelity in LLM as a Communicator (LAAC) Framework in Multiple Application Domains

Mohammed Musthafa Rafi, Adarsh Krishnamurthy, Aditya Balu

Main category: cs.CL

TL;DR: LAAC positions LLMs as communication intermediaries to enable authentic knowledge exchange, addressing trustworthiness issues in information fidelity, reproducibility, and query response integrity across various communication domains.

Details

Motivation: To address the problem of AI-generated content creating communication theater where neither senders nor recipients engage with authentic content, by shifting from cycles of AI-generated inflation and compression to genuine knowledge exchange.

Method: Systematically evaluates trustworthiness requirements through controlled experiments using LAAC’s multi-agent architecture, investigating three dimensions: information capture fidelity, reproducibility, and query response integrity across multiple communication use cases.

Result: Preliminary findings reveal measurable trust gaps in LAAC’s deployment, particularly in high-stakes communication scenarios, indicating that current implementations have limitations in information fidelity, consistency, and reliability.

Conclusion: While LAAC offers a promising paradigm shift for authentic communication, significant trust gaps must be addressed before reliable deployment in critical communication scenarios can be achieved.

Abstract: The proliferation of AI-generated content has created an absurd communication theater where senders use LLMs to inflate simple ideas into verbose content, recipients use LLMs to compress them back into summaries, and as a consequence neither party engage with authentic content. LAAC (LLM as a Communicator) proposes a paradigm shift - positioning LLMs as intelligent communication intermediaries that capture the sender’s intent through structured dialogue and facilitate genuine knowledge exchange with recipients. Rather than perpetuating cycles of AI-generated inflation and compression, LAAC enables authentic communication across diverse contexts including academic papers, proposals, professional emails, and cross-platform content generation. However, deploying LLMs as trusted communication intermediaries raises critical questions about information fidelity, consistency, and reliability. This position paper systematically evaluates the trustworthiness requirements for LAAC’s deployment across multiple communication domains. We investigate three fundamental dimensions: (1) Information Capture Fidelity - accuracy of intent extraction during sender interviews across different communication types, (2) Reproducibility - consistency of structured knowledge across multiple interaction instances, and (3) Query Response Integrity - reliability of recipient-facing responses without hallucination, source conflation, or fabrication. Through controlled experiments spanning multiple LAAC use cases, we assess these trust dimensions using LAAC’s multi-agent architecture. Preliminary findings reveal measurable trust gaps that must be addressed before LAAC can be reliably deployed in high-stakes communication scenarios.

[25] LLM-as-a-Judge is Bad, Based on AI Attempting the Exam Qualifying for the Member of the Polish National Board of Appeal

Michał Karp, Anna Kubaszewska, Magdalena Król, Robert Król, Aleksander Smywiński-Pohl, Mateusz Szymański, Witold Wydmański

Main category: cs.CL

TL;DR: Current LLMs cannot pass Poland’s National Appeal Chamber exam despite good performance on knowledge tests, failing the practical written judgment component and showing limitations in legal reasoning.

Details

Motivation: To empirically assess whether current LLMs can pass official legal qualifying exams and evaluate their potential to replace human judges in specialized legal domains like public procurement.

Method: Tested multiple LLMs (GPT-4.1, Claude 4 Sonnet, Bielik-11B-v2.6) in closed-book and Retrieval-Augmented Generation settings on exam components including multiple-choice knowledge test and written judgment, using hybrid information recovery pipeline and LLM-as-judge evaluation approach.

Result: Models achieved satisfactory scores in knowledge test but none met passing threshold in practical written part; LLM-as-judge evaluations often diverged from official committee judgments.

Conclusion: Despite rapid progress, current LLMs cannot yet replace human judges or examiners in Polish public procurement adjudication due to hallucinations, incorrect legal citations, weak logical argumentation, and need for expert collaboration.

Abstract: This study provides an empirical assessment of whether current large language models (LLMs) can pass the official qualifying examination for membership in Poland’s National Appeal Chamber (Krajowa Izba Odwo{\l}awcza). The authors examine two related ideas: using LLM as actual exam candidates and applying the ‘LLM-as-a-judge’ approach, in which model-generated answers are automatically evaluated by other models. The paper describes the structure of the exam, which includes a multiple-choice knowledge test on public procurement law and a written judgment, and presents the hybrid information recovery and extraction pipeline built to support the models. Several LLMs (including GPT-4.1, Claude 4 Sonnet and Bielik-11B-v2.6) were tested in closed-book and various Retrieval-Augmented Generation settings. The results show that although the models achieved satisfactory scores in the knowledge test, none met the passing threshold in the practical written part, and the evaluations of the ‘LLM-as-a-judge’ often diverged from the judgments of the official examining committee. The authors highlight key limitations: susceptibility to hallucinations, incorrect citation of legal provisions, weaknesses in logical argumentation, and the need for close collaboration between legal experts and technical teams. The findings indicate that, despite rapid technological progress, current LLMs cannot yet replace human judges or independent examiners in Polish public procurement adjudication.

[26] REMIND: Input Loss Landscapes Reveal Residual Memorization in Post-Unlearning LLMs

Liran Cohen, Yaniv Nemcovesky, Avi Mendelson

Main category: cs.CL

TL;DR: REMIND is a novel evaluation method for machine unlearning that detects residual memorization by analyzing loss patterns over small input variations, providing more sensitive assessment than single-point evaluations.

Details

Motivation: Existing unlearning evaluation methods focus on individual inputs and may miss residual influence in semantically similar examples, potentially compromising privacy and leading to information leakage.

Method: REMIND analyzes the model’s loss over small input variations to reveal patterns in the loss landscape, where unlearned data show flatter patterns while retained data show sharper, more volatile patterns.

Result: REMIND outperforms existing methods, demonstrates robustness across different models/datasets/paraphrased inputs, and requires only query-based access, making it practical for real-world deployment.

Conclusion: REMIND provides a more sensitive and interpretable measure of unlearning effectiveness, offering a reliable framework to assess unlearning in language models and a novel perspective on memorization and unlearning.

Abstract: Machine unlearning aims to remove the influence of specific training data from a model without requiring full retraining. This capability is crucial for ensuring privacy, safety, and regulatory compliance. Therefore, verifying whether a model has truly forgotten target data is essential for maintaining reliability and trustworthiness. However, existing evaluation methods often assess forgetting at the level of individual inputs. This approach may overlook residual influence present in semantically similar examples. Such influence can compromise privacy and lead to indirect information leakage. We propose REMIND (Residual Memorization In Neighborhood Dynamics), a novel evaluation method aiming to detect the subtle remaining influence of unlearned data and classify whether the data has been effectively forgotten. REMIND analyzes the model’s loss over small input variations and reveals patterns unnoticed by single-point evaluations. We show that unlearned data yield flatter, less steep loss landscapes, while retained or unrelated data exhibit sharper, more volatile patterns. REMIND requires only query-based access, outperforms existing methods under similar constraints, and demonstrates robustness across different models, datasets, and paraphrased inputs, making it practical for real-world deployment. By providing a more sensitive and interpretable measure of unlearning effectiveness, REMIND provides a reliable framework to assess unlearning in language models. As a result, REMIND offers a novel perspective on memorization and unlearning.

[27] Reusing Pre-Training Data at Test Time is a Compute Multiplier

Alex Fang, Thomas Voice, Ruoming Pang, Ludwig Schmidt, Tom Gunter

Main category: cs.CL

TL;DR: Pre-training leaves significant dataset value unused; retrieval at test time provides 5x compute multiplier gains on MMLU and up to 10 percentage point improvements.

Details

Motivation: To quantify how much dataset value is left behind by pre-training and how this changes across model scale, since current methods may not fully utilize pre-training data.

Method: Use retrieval augmented generation with test-time compute to measure unused dataset value, applying this to standard open-sourced datasets and evaluating on MMLU, Math-500, and SimpleQA.

Result: Significant accuracy gains persist through decontamination; retrieval acts as 5x compute multiplier on MMLU; LLaMA 3.1 8B shows 10 percentage point improvement on MMLU with additional test-time compute.

Conclusion: Today’s pre-training methods do not fully utilize information in existing datasets, leaving substantial room for improvement in data efficiency.

Abstract: Large language models learn from their vast pre-training corpora, gaining the ability to solve an ever increasing variety of tasks; yet although researchers work to improve these datasets, there is little effort to understand how efficient the pre-training apparatus is at extracting ideas and knowledge from the data. In this work, we use retrieval augmented generation along with test-time compute as a way to quantify how much dataset value was left behind by the process of pre-training, and how this changes across scale. We demonstrate that pre-training then retrieving from standard and largely open-sourced datasets results in significant accuracy gains in MMLU, Math-500, and SimpleQA, which persist through decontamination. For MMLU we observe that retrieval acts as a ~5x compute multiplier versus pre-training alone. We show that these results can be further improved by leveraging additional compute at test time to parse the retrieved context, demonstrating a 10 percentage point improvement on MMLU for the public LLaMA 3.1 8B model. Overall, our results suggest that today’s pre-training methods do not make full use of the information in existing pre-training datasets, leaving significant room for progress.

[28] Efficient Topic Extraction via Graph-Based Labeling: A Lightweight Alternative to Deep Models

Salma Mekaoui, Hiba Sofyan, Imane Amaaz, Imane Benchrif, Arsalane Zarghili, Ilham Chaker, Nikola S. Nikolov

Main category: cs.CL

TL;DR: A graph-based approach for topic labeling that enriches topic words with semantically related terms and analyzes their relationships to derive meaningful labels, achieving results comparable to ChatGPT-3.5 while being computationally efficient.

Details

Motivation: Topic modeling produces topics as word distributions that lack clear interpretability, and existing computational methods are resource-intensive. The goal is to assign meaningful labels to topic word sets without relying on expensive models.

Method: Proposes a graph-based approach that enriches topic words with semantically related terms and explores relationships among them through graph analysis to derive suitable topic labels.

Result: The method consistently outperformed traditional benchmarks in BERTScore and cosine similarity, and produced results comparable to ChatGPT-3.5 while remaining computationally efficient across two datasets.

Conclusion: The graph-based approach provides an effective alternative for topic labeling that balances performance and computational efficiency, with future directions focusing on enhancing interpretability and automation.

Abstract: Extracting topics from text has become an essential task, especially with the rapid growth of unstructured textual data. Most existing works rely on highly computational methods to address this challenge. In this paper, we argue that probabilistic and statistical approaches, such as topic modeling (TM), can offer effective alternatives that require fewer computational resources. TM is a statistical method that automatically discovers topics in large collections of unlabeled text; however, it produces topics as distributions of representative words, which often lack clear interpretability. Our objective is to perform topic labeling by assigning meaningful labels to these sets of words. To achieve this without relying on computationally expensive models, we propose a graph-based approach that not only enriches topic words with semantically related terms but also explores the relationships among them. By analyzing these connections within the graph, we derive suitable labels that accurately capture each topic’s meaning. We present a comparative study between our proposed method and several benchmarks, including ChatGPT-3.5, across two different datasets. Our method achieved consistently better results than traditional benchmarks in terms of BERTScore and cosine similarity and produced results comparable to ChatGPT-3.5, while remaining computationally efficient. Finally, we discuss future directions for topic labeling and highlight potential research avenues for enhancing interpretability and automation.

[29] SSPO: Subsentence-level Policy Optimization

Kun Yang, Zikang chen, Yanmeng Wang, Zhigen Li

Main category: cs.CL

TL;DR: SSPO introduces sentence-level importance ratio in RLVR to balance between token-level GRPO (unstable) and response-level GSPO (low data usage), achieving better performance with stable training and higher data utilization.

Details

Motivation: Existing RLVR algorithms like GRPO suffer from unstable policy updates due to token-level importance ratios, while GSPO has low data utilization due to response-level importance ratios that can discard entire responses. There's a need for a balanced approach.

Method: SSPO applies sentence-level importance ratio to balance GRPO and GSPO. It also uses sentence entropy to adjust PPO-CLIP clipping bounds, encouraging exploration for high-entropy tokens while narrowing bounds for low-entropy tokens.

Result: SSPO achieves average score of 46.57 across five datasets, surpassing GRPO (43.01) and GSPO (44.42), and wins state-of-the-art performance on three datasets.

Conclusion: SSPO effectively leverages generated data by taking the advantages of GSPO while avoiding its shortcomings, demonstrating superior performance in RLVR for LLM reasoning.

Abstract: As a significant part of post-training of the Large Language Models (LLMs), Reinforcement Learning from Verifiable Reward (RLVR) has greatly improved LLMs’ reasoning skills. However, some RLVR algorithms, such as GRPO (Group Relative Policy Optimization) and GSPO (Group Sequence Policy Optimization), are observed to suffer from unstable policy updates and low usage of sampling data, respectively. The importance ratio of GRPO is calculated at the token level, which focuses more on optimizing a single token. This will be easily affected by outliers, leading to model training collapse. GSPO proposed the calculation of the response level importance ratio, which solves the problem of high variance and training noise accumulation in the calculation of the GRPO importance ratio. However, since all the response tokens share a common importance ratio, extreme values can easily raise or lower the overall mean, leading to the entire response being mistakenly discarded, resulting in a decrease in the utilization of sampled data. This paper introduces SSPO, which applies sentence-level importance ratio, taking the balance between GRPO and GSPO. SSPO not only avoids training collapse and high variance, but also prevents the whole response tokens from being abandoned by the clipping mechanism. Furthermore, we apply sentence entropy to PPO-CLIP to steadily adjust the clipping bounds, encouraging high-entropy tokens to explore and narrow the clipping range of low-entropy tokens. In particular, SSPO achieves an average score of 46.57 across five datasets, surpassing GRPO (43.01) and GSPO (44.42), and wins state-of-the-art performance on three datasets. These results highlight SSPO’s effectiveness in leveraging generated data by taking the essence of GSPO but rejecting its shortcomings.

[30] Dynamic Jointly Batch Selection for Data Efficient Machine Translation Fine-Tuning

Mohammad Amin Ghanizadeh, Mohammad Javad Dousti

Main category: cs.CL

TL;DR: A data selection method for fine-tuning machine translation that uses learnability scores and batch selection to improve data efficiency and translation performance.

Details

Motivation: Data quality and effective selection are fundamental for improving machine translation model performance and achieving robust, reliable systems.

Method: Leverages synergy between learner and pre-trained reference models, defines learnability scores to evaluate data utility, and employs batch selection considering interdependencies among data points.

Result: Achieves up to 5x improvement in data efficiency compared to iid baseline, 24x computational efficiency with cached embeddings, and superior translation performance over random selection.

Conclusion: The proposed data selection methodology effectively enhances training efficiency and generalization for machine translation systems.

Abstract: Data quality and its effective selection are fundamental to improving the performance of machine translation models, serving as cornerstones for achieving robust and reliable translation systems. This paper presents a data selection methodology specifically designed for fine-tuning machine translation systems, which leverages the synergy between a learner model and a pre-trained reference model to enhance overall training effectiveness. By defining a learnability score, our approach systematically evaluates the utility of data points for training, ensuring that only the most relevant and impactful examples contribute to the fine-tuning process. Furthermore, our method employs a batch selection strategy which considers interdependencies among data points, optimizing the efficiency of the training process while maintaining a focus on data relevance. Experiments on English to Persian and several other language pairs using an mBART model fine-tuned on the CCMatrix dataset demonstrate that our method can achieve up to a fivefold improvement in data efficiency compared to an iid baseline. Experimental results indicate that our approach improves computational efficiency by 24 when utilizing cached embeddings, as it requires fewer training data points. Additionally, it enhances generalization, resulting in superior translation performance compared to random selection method.

[31] If I Could Turn Back Time: Temporal Reframing as a Historical Reasoning Task for LLMs

Lars Bungum, Charles Yijia Huang, Abeer Kashar

Main category: cs.CL

TL;DR: LLMs were tested on temporal reasoning using 1940 Norwegian trivia questions, answering as if in 1940. English prompts outperformed Norwegian, and larger models performed better across tested model families.

Details

Motivation: To evaluate LLMs' temporal reasoning capabilities by testing their ability to answer historical trivia questions from a 1940 Norwegian book while adopting the temporal context of that era.

Method: Used a 1940 Norwegian trivia book, prompted LLMs to answer questions as if in 1940, tested in both English and Norwegian, employed LLM-as-judge for grading with human validation.

Result: English prompting consistently yielded better results than Norwegian. Larger LLMs showed improved performance across DeepSeek-R1, Gemma3, Qwen3, and Llama3.1 model families.

Conclusion: LLMs demonstrate temporal reasoning capabilities but perform better with English prompts than Norwegian, even when testing specialized Norwegian LLMs. Model size positively correlates with performance.

Abstract: In this study, we experiment with the ability of LLMs to do temporal reasoning. Using a Norwegian book from 1940 containing trivia questions, we prompt the LLMs to answer the questions as if it were 1940. We also pose the questions in both English and Norwegian. Correct answers are often presented as sentences, and grading is done by means of LLM-as-judge, with sampled checks by a native speaker. Prompting in English consistently gave better results than in Norwegian, an unexpected result. In contrast, using larger LLMs improved results. We tested the DeepSeek-R1, Gemma3, Qwen3, and Llama3.1 model families, and also the largest available LLM especially crafted for Norwegian.

[32] Probabilistic Textual Time Series Depression Detection

Fabian Schmidt, Seyedehmoniba Ravan, Vladimir Vlassov

Main category: cs.CL

TL;DR: PTTSD is a probabilistic framework for depression detection from clinical interviews that predicts PHQ-8 scores with uncertainty estimates over time, achieving state-of-the-art performance among text-only systems.

Details

Motivation: Existing depression severity prediction models lack uncertainty estimates and temporal modeling, which are essential for clinical decision support and interpretable predictions.

Method: Proposed PTTSD framework with sequence-to-sequence and sequence-to-one variants using bidirectional LSTMs, self-attention, residual connections, and Gaussian/Student-t output heads trained via negative log-likelihood.

Result: Achieved state-of-the-art performance on E-DAIC and DAIC-WOZ datasets (MAE = 3.85 on E-DAIC, 3.55 on DAIC) with well-calibrated prediction intervals. Ablations confirmed value of attention and probabilistic modeling.

Conclusion: PTTSD provides interpretable and clinically relevant uncertainty-aware forecasting for depression detection, with demonstrated generality and well-calibrated uncertainty estimates.

Abstract: Accurate and interpretable predictions of depression severity are essential for clinical decision support, yet existing models often lack uncertainty estimates and temporal modeling. We propose PTTSD, a Probabilistic Textual Time Series Depression Detection framework that predicts PHQ-8 scores from utterance-level clinical interviews while modeling uncertainty over time. PTTSD includes sequence-to-sequence and sequence-to-one variants, both combining bidirectional LSTMs, self-attention, and residual connections with Gaussian or Student-t output heads trained via negative log-likelihood. Evaluated on E-DAIC and DAIC-WOZ, PTTSD achieves state-of-the-art performance among text-only systems (e.g., MAE = 3.85 on E-DAIC, 3.55 on DAIC) and produces well-calibrated prediction intervals. Ablations confirm the value of attention and probabilistic modeling, while comparisons with MentalBERT establish generality. A three-part calibration analysis and qualitative case studies further highlight the interpretability and clinical relevance of uncertainty-aware forecasting.

[33] ThaiOCRBench: A Task-Diverse Benchmark for Vision-Language Understanding in Thai

Surapon Nonesung, Teetouch Jaknamon, Sirinya Chaiophat, Natapong Nitarach, Chanakan Wittayasakpan, Warit Sirichotedumrong, Adisai Na-Thalang, Kunat Pipatanakul

Main category: cs.CL

TL;DR: ThaiOCRBench is the first comprehensive benchmark for evaluating vision-language models on Thai text-rich visual understanding tasks, addressing the underrepresentation of Thai in existing benchmarks.

Details

Motivation: Existing multimodal benchmarks predominantly focus on high-resource languages, leaving Thai underrepresented, especially in tasks requiring document structure understanding.

Method: Created a diverse, human-annotated dataset of 2,808 samples across 13 task categories and evaluated state-of-the-art VLMs in zero-shot settings, including both proprietary and open-source systems.

Result: Proprietary models (e.g., Gemini 2.5 Pro) significantly outperform open-source counterparts, with fine-grained text recognition and handwritten content extraction showing the steepest performance drops among open-source models.

Conclusion: ThaiOCRBench provides a standardized framework for assessing VLMs in low-resource, script-complex settings and identifies key challenges like language bias, structural mismatch, and hallucinated content for improving Thai-language document understanding.

Abstract: We present ThaiOCRBench, the first comprehensive benchmark for evaluating vision-language models (VLMs) on Thai text-rich visual understanding tasks. Despite recent progress in multimodal modeling, existing benchmarks predominantly focus on high-resource languages, leaving Thai underrepresented, especially in tasks requiring document structure understanding. ThaiOCRBench addresses this gap by offering a diverse, human-annotated dataset comprising 2,808 samples across 13 task categories. We evaluate a wide range of state-of-the-art VLMs in a zero-shot setting, spanning both proprietary and open-source systems. Results show a significant performance gap, with proprietary models (e.g., Gemini 2.5 Pro) outperforming open-source counterparts. Notably, fine-grained text recognition and handwritten content extraction exhibit the steepest performance drops among open-source models. Through detailed error analysis, we identify key challenges such as language bias, structural mismatch, and hallucinated content. ThaiOCRBench provides a standardized framework for assessing VLMs in low-resource, script-complex settings, and provides actionable insights for improving Thai-language document understanding.

[34] RUST-BENCH: Benchmarking LLM Reasoning on Unstructured Text within Structured Tables

Nikhil Abhyankar, Purvi Chaurasia, Sanchit Kabra, Ananya Srivastava, Vivek Gupta, Chandan K. Reddy

Main category: cs.CL

TL;DR: RUST-BENCH is a new benchmark with 7966 questions from 2031 real-world tables across science and sports domains, designed to test LLMs on complex tabular reasoning with heterogeneous, domain-specific data requiring multi-hop inference.

Details

Motivation: Existing tabular reasoning benchmarks use small, uniform tables that don't represent real-world complexity, giving an incomplete view of LLMs' reasoning abilities on realistic data.

Method: Created RUST-BENCH with 7966 questions from 2031 real-world tables spanning NSF grant records (RB-Science) and NBA statistics (RB-Sports), evaluating LLMs across scale, heterogeneity, domain specificity, and reasoning complexity.

Result: LLMs struggle with heterogeneous schemas and complex multi-hop inference, revealing persistent weaknesses in current architectures and prompting strategies.

Conclusion: RUST-BENCH establishes a challenging new testbed for advancing tabular reasoning research by better representing real-world table complexity.

Abstract: Existing tabular reasoning benchmarks mostly test models on small, uniform tables, underrepresenting the complexity of real-world data and giving an incomplete view of Large Language Models’ (LLMs) reasoning abilities. Real tables are long, heterogeneous, and domain-specific, mixing structured fields with free text and requiring multi-hop reasoning across thousands of tokens. To address this gap, we introduce RUST-BENCH, a benchmark of 7966 questions from 2031 real-world tables spanning two domains: i) RB-Science (NSF grant records) and ii) RB-Sports (NBA statistics). Unlike prior work, RUST-BENCH evaluates LLMs jointly across scale, heterogeneity, domain specificity, and reasoning complexity. Experiments with open-source and proprietary models show that LLMs struggle with heterogeneous schemas and complex multi-hop inference, revealing persistent weaknesses in current architectures and prompting strategies. RUST-BENCH establishes a challenging new testbed for advancing tabular reasoning research.

[35] OUNLP at TSAR 2025 Shared Task: Multi-Round Text Simplifier via Code Generation

Cuong Huynh, Jie Cao

Main category: cs.CL

TL;DR: The paper presents OUNLP system for TSAR-2025 Shared Task, using LLM-prompting for readability-controlled text simplification. It proposes multi-round simplification methods (MRS-Rule and MRS-Joint) based on finding that performance relates to CEFR level gap.

Details

Motivation: The motivation stems from discovering that text simplification performance is highly related to the gap between source and target CEFR levels, leading to development of multi-round simplification approaches.

Method: Proposed two multi-round simplification methods using GPT-4o: rule-based simplification (MRS-Rule) and jointly rule-based LLM simplification (MRS-Joint), where MRS-Joint uses LLM simplified candidates as starting point.

Result: The submitted systems ranked 7 out of 20 teams, and later improvements showed that using LLM simplified candidates as starting point in MRS-Joint could further boost multi-round simplification performance.

Conclusion: Multi-round simplification methods, particularly MRS-Joint that leverages LLM simplified candidates as starting points, effectively improve text simplification performance by addressing the CEFR level gap challenge.

Abstract: This paper describes the OUNLP system submitted to the TSAR-2025 Shared Task (Alva-Manchego et al., 2025), designed for readability-controlled text simplification using LLM-prompting-based generation. Based on the analysis of prompt-based text simplification methods, we discovered an interesting finding that text simplification performance is highly related to the gap between the source CEFR (Arase et al., 2022) level and the target CEFR level. Inspired by this finding, we propose two multi-round simplification methods and generate them via GPT-4o: rule-based simplification (MRS-Rule) and jointly rule-based LLM simplification (MRS-Joint). Our submitted systems ranked 7 out of 20 teams. Later improvements with MRS-Joint show that taking the LLM simplified candidates as the starting point could further boost the multi-round simplification performance.

[36] Decoding Emergent Big Five Traits in Large Language Models: Temperature-Dependent Expression and Architectural Clustering

Christos-Nikolaos Zacharopoulos, Revekka Kyriakoglou

Main category: cs.CL

TL;DR: This paper systematically evaluates personality traits in six LLMs using the Big Five Inventory-2 framework, finding significant differences across personality dimensions and temperature sensitivity, with implications for model tuning and ethical AI governance.

Details

Motivation: As LLMs become integral to human-centered applications, understanding their personality-like behaviors is important for responsible development and deployment.

Method: Applied the Big Five Inventory-2 (BFI-2) framework to systematically evaluate six LLMs under varying sampling temperatures, using hierarchical clustering to analyze trait patterns.

Result: Found significant differences across four of five personality dimensions, with Neuroticism and Extraversion particularly susceptible to temperature adjustments. Hierarchical clustering revealed distinct model clusters suggesting architectural features influence stable trait profiles.

Conclusion: Results provide new insights into personality-like patterns in LLMs and offer a new perspective on model tuning, selection, and ethical governance of AI systems.

Abstract: As Large Language Models (LLMs) become integral to human-centered applications, understanding their personality-like behaviors is increasingly important for responsible development and deployment. This paper systematically evaluates six LLMs, applying the Big Five Inventory-2 (BFI-2) framework, to assess trait expressions under varying sampling temperatures. We find significant differences across four of the five personality dimensions, with Neuroticism and Extraversion susceptible to temperature adjustments. Further, hierarchical clustering reveals distinct model clusters, suggesting that architectural features may predispose certain models toward stable trait profiles. Taken together, these results offer new insights into the emergence of personality-like patterns in LLMs and provide a new perspective on model tuning, selection, and the ethical governance of AI systems. We share the data and code for this analysis here: https://osf.io/bsvzc/?view_only=6672219bede24b4e875097426dc3fac1

[37] RAGalyst: Automated Human-Aligned Agentic Evaluation for Domain-Specific RAG

Joshua Gao, Quoc Huy Pham, Subin Varghese, Silwal Saurav, Vedhus Hoskere

Main category: cs.CL

TL;DR: RAGalyst is an automated, human-aligned agentic framework for evaluating domain-specific RAG systems, featuring synthetic QA dataset generation and optimized LLM-as-a-Judge metrics that correlate with human judgment.

Details

Motivation: Existing RAG evaluation frameworks use heuristic metrics that fail to capture domain nuances or LLM-as-a-Judge approaches that lack validated human alignment, creating challenges for safety-critical domains.

Method: Uses an agentic pipeline to generate synthetic QA datasets from source documents with filtering for data fidelity, and refines Answer Correctness and Answerability metrics through prompt optimization to achieve human correlation.

Result: Performance varies significantly across domains (military, cybersecurity, bridge engineering) with no universally optimal configuration, and analysis reveals common causes of low Answer Correctness in RAG systems.

Conclusion: Systematic evaluation frameworks like RAGalyst are essential for practitioners to understand domain-specific trade-offs and make informed design choices for reliable RAG systems.

Abstract: Retrieval-Augmented Generation (RAG) is a critical technique for grounding Large Language Models (LLMs) in factual evidence, yet evaluating RAG systems in specialized, safety-critical domains remains a significant challenge. Existing evaluation frameworks often rely on heuristic-based metrics that fail to capture domain-specific nuances and other works utilize LLM-as-a-Judge approaches that lack validated alignment with human judgment. This paper introduces RAGalyst, an automated, human-aligned agentic framework designed for the rigorous evaluation of domain-specific RAG systems. RAGalyst features an agentic pipeline that generates high-quality, synthetic question-answering (QA) datasets from source documents, incorporating an agentic filtering step to ensure data fidelity. The framework refines two key LLM-as-a-Judge metrics-Answer Correctness and Answerability-using prompt optimization to achieve a strong correlation with human annotations. Applying this framework to evaluate various RAG components across three distinct domains (military operations, cybersecurity, and bridge engineering), we find that performance is highly context-dependent. No single embedding model, LLM, or hyperparameter configuration proves universally optimal. Additionally, we provide an analysis on the most common low Answer Correctness reasons in RAG. These findings highlight the necessity of a systematic evaluation framework like RAGalyst, which empowers practitioners to uncover domain-specific trade-offs and make informed design choices for building reliable and effective RAG systems. RAGalyst is available on our Github.

[38] Modeling Clinical Uncertainty in Radiology Reports: from Explicit Uncertainty Markers to Implicit Reasoning Pathways

Paloma Rabaey, Jong Hak Moon, Jung-Oh Lee, Min Gwan Kim, Hangyul Yoon, Thomas Demeester, Edward Choi

Main category: cs.CL

TL;DR: The paper introduces Lunguage++, an expanded uncertainty-aware benchmark for radiology reports that quantifies both explicit and implicit uncertainty to enable better automated analysis and clinical decision-making.

Details

Motivation: Radiology reports contain uncertainty that hampers automated analysis - explicit uncertainty from hedging phrases and implicit uncertainty from omitted reasoning. Current rule-based systems are insufficient for quantifying these uncertainties.

Method: Two-part framework: (1) Quantify explicit uncertainty using LLM-based reference ranking of hedging phrases mapped to probability values, (2) Model implicit uncertainty through expansion framework that adds characteristic sub-findings from expert-defined diagnostic pathways for 14 common diagnoses.

Result: Created Lunguage++, an expanded uncertainty-aware version of the Lunguage benchmark with fine-grained structured radiology reports that enables uncertainty-aware image classification and faithful diagnostic reasoning.

Conclusion: The framework successfully addresses both types of uncertainty in radiology reports, providing an enriched resource for investigating clinical impact of diagnostic uncertainty and improving automated analysis capabilities.

Abstract: Radiology reports are invaluable for clinical decision-making and hold great potential for automated analysis when structured into machine-readable formats. These reports often contain uncertainty, which we categorize into two distinct types: (i) Explicit uncertainty reflects doubt about the presence or absence of findings, conveyed through hedging phrases. These vary in meaning depending on the context, making rule-based systems insufficient to quantify the level of uncertainty for specific findings; (ii) Implicit uncertainty arises when radiologists omit parts of their reasoning, recording only key findings or diagnoses. Here, it is often unclear whether omitted findings are truly absent or simply unmentioned for brevity. We address these challenges with a two-part framework. We quantify explicit uncertainty by creating an expert-validated, LLM-based reference ranking of common hedging phrases, and mapping each finding to a probability value based on this reference. In addition, we model implicit uncertainty through an expansion framework that systematically adds characteristic sub-findings derived from expert-defined diagnostic pathways for 14 common diagnoses. Using these methods, we release Lunguage++, an expanded, uncertainty-aware version of the Lunguage benchmark of fine-grained structured radiology reports. This enriched resource enables uncertainty-aware image classification, faithful diagnostic reasoning, and new investigations into the clinical impact of diagnostic uncertainty.

[39] Are language models aware of the road not taken? Token-level uncertainty and hidden state dynamics

Amir Zur, Atticus Geiger, Ekdeep Singh Lubana, Eric Bigelow

Main category: cs.CL

TL;DR: Language models implicitly represent alternate reasoning paths during generation, and hidden activations can predict and control model uncertainty in chain-of-thought reasoning.

Details

Motivation: To understand whether language models represent the alternate reasoning paths they could take during text generation, and to quantify model uncertainty.

Method: Using hidden activations to control and predict language model uncertainty during chain-of-thought reasoning, measuring correlation between uncertainty and steering effectiveness.

Result: Found clear correlation between model uncertainty and steering effectiveness via activations; hidden activations can predict future outcome distributions.

Conclusion: Models implicitly represent possible reasoning paths, and activation interventions are most effective when models haven’t committed to final answers.

Abstract: When a language model generates text, the selection of individual tokens might lead it down very different reasoning paths, making uncertainty difficult to quantify. In this work, we consider whether reasoning language models represent the alternate paths that they could take during generation. To test this hypothesis, we use hidden activations to control and predict a language model’s uncertainty during chain-of-thought reasoning. In our experiments, we find a clear correlation between how uncertain a model is at different tokens, and how easily the model can be steered by controlling its activations. This suggests that activation interventions are most effective when there are alternate paths available to the model – in other words, when it has not yet committed to a particular final answer. We also find that hidden activations can predict a model’s future outcome distribution, demonstrating that models implicitly represent the space of possible paths.

[40] IntelliProof: An Argumentation Network-based Conversational Helper for Organized Reflection

Kaveh Eskandari Miandoab, Katharine Kowalyshyn, Kabir Pamnani, Anesu Gavhera, Vasanth Sarathy, Matthias Scheutz

Main category: cs.CL

TL;DR: IntelliProof is an interactive system that uses LLMs to analyze argumentative essays by structuring them as argumentation graphs with claims as nodes, evidence as properties, and support/attack relations as edges.

Details

Motivation: To bridge the gap between automated essay scoring systems and user understanding by providing interactive visualization and analysis of argumentative structure with human oversight.

Method: Uses LLMs to classify and score argument relations, visualizes essays as argumentation graphs, provides justifications for classifications, and offers quantitative coherence measures.

Result: Developed a working system that enables rapid exploration of argumentative quality while maintaining human oversight, with a live demo available.

Conclusion: IntelliProof successfully bridges structural semantics of argumentative essays with user understanding through interactive visualization and natural language tools.

Abstract: We present IntelliProof, an interactive system for analyzing argumentative essays through LLMs. IntelliProof structures an essay as an argumentation graph, where claims are represented as nodes, supporting evidence is attached as node properties, and edges encode supporting or attacking relations. Unlike existing automated essay scoring systems, IntelliProof emphasizes the user experience: each relation is initially classified and scored by an LLM, then visualized for enhanced understanding. The system provides justifications for classifications and produces quantitative measures for essay coherence. It enables rapid exploration of argumentative quality while retaining human oversight. In addition, IntelliProof provides a set of tools for a better understanding of an argumentative essay and its corresponding graph in natural language, bridging the gap between the structural semantics of argumentative essays and the user’s understanding of a given text. A live demo and the system are available here to try: \textbf{https://intelliproof.vercel.app}

[41] From Model to Breach: Towards Actionable LLM-Generated Vulnerabilities Reporting

Cyril Vallez, Alexander Sternfeld, Andrei Kucharavy, Ljiljana Dolamic

Main category: cs.CL

TL;DR: LLM-based coding assistants generate security vulnerabilities that persist in latest models, prompting development of new severity metrics (Prompt Exposure and Model Exposure) to prioritize mitigation.

Details

Motivation: As LLM coding assistants become critical in software development, their generated bugs pose significant cybersecurity risks, and current benchmarks haven't effectively improved model security.

Method: Introduces two new metrics: Prompt Exposure (PE) accounting for vulnerability severity, generation chance, and prompt formulation; and Model Exposure (ME) score indicating overall vulnerability severity and prevalence.

Result: Even latest open-weight models remain vulnerable to earliest reported vulnerability scenarios in realistic settings, suggesting safety-functionality trade-offs prevent effective patching.

Conclusion: New severity metrics are needed to prioritize mitigation of the most serious and prevalent LLM-generated vulnerabilities, as current approaches haven’t sufficiently improved model security.

Abstract: As the role of Large Language Models (LLM)-based coding assistants in software development becomes more critical, so does the role of the bugs they generate in the overall cybersecurity landscape. While a number of LLM code security benchmarks have been proposed alongside approaches to improve the security of generated code, it remains unclear to what extent they have impacted widely used coding LLMs. Here, we show that even the latest open-weight models are vulnerable in the earliest reported vulnerability scenarios in a realistic use setting, suggesting that the safety-functionality trade-off has until now prevented effective patching of vulnerabilities. To help address this issue, we introduce a new severity metric that reflects the risk posed by an LLM-generated vulnerability, accounting for vulnerability severity, generation chance, and the formulation of the prompt that induces vulnerable code generation - Prompt Exposure (PE). To encourage the mitigation of the most serious and prevalent vulnerabilities, we use PE to define the Model Exposure (ME) score, which indicates the severity and prevalence of vulnerabilities a model generates.

[42] BanglaMedQA and BanglaMMedBench: Evaluating Retrieval-Augmented Generation Strategies for Bangla Biomedical Question Answering

Sadia Sultana, Saiyma Sittul Muna, Mosammat Zannatul Samarukh, Ajwad Abrar, Tareque Mohmud Chowdhury

Main category: cs.CL

TL;DR: This paper introduces BanglaMedQA and BanglaMMedBench, the first large-scale Bangla biomedical MCQ datasets, and benchmarks various RAG strategies for medical QA in low-resource languages.

Details

Motivation: To address the challenge of developing accurate biomedical QA systems in low-resource languages like Bangla, which limits equitable access to reliable medical knowledge.

Method: Applied and benchmarked several RAG strategies including Traditional, Zero-Shot Fallback, Agentic, Iterative Feedback, and Aggregate RAG, combining textbook-based and web retrieval with generative reasoning. Integrated a Bangla medical textbook corpus through OCR and implemented an Agentic RAG pipeline that dynamically selects between retrieval and reasoning strategies.

Result: Agentic RAG achieved the highest accuracy of 89.54% with openai/gpt-oss-120b, outperforming other configurations and demonstrating superior rationale quality.

Conclusion: RAG-based methods show potential to enhance reliability and accessibility of Bangla medical QA, establishing a foundation for future multilingual medical AI research.

Abstract: Developing accurate biomedical Question Answering (QA) systems in low-resource languages remains a major challenge, limiting equitable access to reliable medical knowledge. This paper introduces BanglaMedQA and BanglaMMedBench, the first large-scale Bangla biomedical Multiple Choice Question (MCQ) datasets designed to evaluate reasoning and retrieval in medical artificial intelligence (AI). The study applies and benchmarks several Retrieval-Augmented Generation (RAG) strategies, including Traditional, Zero-Shot Fallback, Agentic, Iterative Feedback, and Aggregate RAG, combining textbook-based and web retrieval with generative reasoning to improve factual accuracy. A key novelty lies in integrating a Bangla medical textbook corpus through Optical Character Recognition (OCR) and implementing an Agentic RAG pipeline that dynamically selects between retrieval and reasoning strategies. Experimental results show that the Agentic RAG achieved the highest accuracy 89.54% with openai/gpt-oss-120b, outperforming other configurations and demonstrating superior rationale quality. These findings highlight the potential of RAG-based methods to enhance the reliability and accessibility of Bangla medical QA, establishing a foundation for future research in multilingual medical artificial intelligence.

[43] When retrieval outperforms generation: Dense evidence retrieval for scalable fake news detection

Alamgir Munir Qazi, John P. McCrae, Jamal Abdul Nasir

Main category: cs.CL

TL;DR: DeReC is a lightweight fact verification framework that uses dense retrieval and classification to outperform LLM-based methods in efficiency and accuracy, reducing runtime by 92-95% while achieving better F1 scores.

Details

Motivation: Current LLM-based fact verification systems face computational inefficiency and hallucination risks, making them impractical for real-world deployment despite their effectiveness.

Method: Combines dense retrieval with specialized classification using general-purpose text embeddings instead of autoregressive LLMs for fact verification.

Result: Achieved 65.58% F1 score on RAWFC (surpassing L-Defense’s 61.20%) with 95% runtime reduction on RAWFC (23m36s vs 454m12s) and 92% reduction on LIAR-RAW (134m14s vs 1692m23s).

Conclusion: Retrieval-based systems can match or exceed LLM performance in specialized tasks while being significantly more practical for real-world deployment.

Abstract: The proliferation of misinformation necessitates robust yet computationally efficient fact verification systems. While current state-of-the-art approaches leverage Large Language Models (LLMs) for generating explanatory rationales, these methods face significant computational barriers and hallucination risks in real-world deployments. We present DeReC (Dense Retrieval Classification), a lightweight framework that demonstrates how general-purpose text embeddings can effectively replace autoregressive LLM-based approaches in fact verification tasks. By combining dense retrieval with specialized classification, our system achieves better accuracy while being significantly more efficient. DeReC outperforms explanation-generating LLMs in efficiency, reducing runtime by 95% on RAWFC (23 minutes 36 seconds compared to 454 minutes 12 seconds) and by 92% on LIAR-RAW (134 minutes 14 seconds compared to 1692 minutes 23 seconds), showcasing its effectiveness across varying dataset sizes. On the RAWFC dataset, DeReC achieves an F1 score of 65.58%, surpassing the state-of-the-art method L-Defense (61.20%). Our results demonstrate that carefully engineered retrieval-based systems can match or exceed LLM performance in specialized tasks while being significantly more practical for real-world deployment.

[44] Logit-Entropy Adaptive Stopping Heuristic for Efficient Chain-of-Thought Reasoning

Mohammad Atif Quamar, Mohammad Areeb

Main category: cs.CL

TL;DR: LEASH is a training-free decoding algorithm that adaptively stops CoT rationale generation by monitoring entropy slope and logit margin, reducing tokens by 30-35% and latency by 27% with minimal accuracy loss.

Details

Motivation: Full-length CoT rationales are computationally wasteful, increasing token usage and latency without proportional benefits.

Method: LEASH monitors token-level entropy slope and top-logit margin improvement, stopping generation when both signals plateau indicating stable reasoning state.

Result: Reduces average token generation by 30-35% and latency by 27% across four models on GSM8K and AQuA-RAT, with only 10 p.p. accuracy drop compared to full CoT.

Conclusion: LEASH provides a model-agnostic, training-free alternative to CoT that significantly improves efficiency while maintaining most of the reasoning benefits.

Abstract: Chain-of-Thought (CoT) prompting is a key technique for enabling complex reasoning in large language models. However, generating full, fixed-length rationales is computationally wasteful, inflating both token usage and latency. We introduce LEASH: Logit-Entropy Adaptive Stopping Heuristic, a training-free decoding algorithm that adaptively halts rationale generation. LEASH monitors two intrinsic signals: the slope of token-level entropy and the improvement in the top-logit margin. It terminates the generation once both signals plateau, indicating the model has reached a stable reasoning state. Across four instruction-tuned models on the GSM8K and AQuA-RAT benchmarks, LEASH reduces average token generation by 30–35% and latency by 27%, while incurring a 10 p.p. accuracy drop relative to CoT. LEASH is model-agnostic and requires no additional training or supervision, offering a simple and efficient alternative to CoT decoding.

[45] Decomposed Prompting: Probing Multilingual Linguistic Structure Knowledge in Large Language Models

Ercong Nie, Shuzhou Yuan, Bolei Ma, Helmut Schmid, Michael Färber, Frauke Kreuter, Hinrich Schütze

Main category: cs.CL

TL;DR: Decomposed prompting for sequence labeling tasks improves multilingual linguistic structure analysis in LLMs by generating individual prompts for each token instead of using single text-to-text prompts.

Details

Motivation: Current text-to-text prompting strategies struggle with maintaining output templates when probing multilingual knowledge of linguistic structure in LLMs, particularly for sequence labeling tasks.

Method: Introduces decomposed prompting approach that generates individual prompts for each token in input sentences, asking for linguistic labels, tested on Universal Dependencies POS tagging across 38 languages using English-centric and multilingual LLMs.

Result: Decomposed prompting outperforms iterative prompting baseline in both efficacy and efficiency under zero- and few-shot settings, and provides insights into transferability of linguistic knowledge via multilingual prompting.

Conclusion: The decomposed prompting method effectively addresses output template challenges in sequence labeling tasks and reveals valuable insights about multilingual knowledge transfer in LLMs.

Abstract: Probing the multilingual knowledge of linguistic structure in LLMs, often characterized as sequence labeling, faces challenges with maintaining output templates in current text-to-text prompting strategies. To solve this, we introduce a decomposed prompting approach for sequence labeling tasks. Diverging from the single text-to-text prompt, our prompt method generates for each token of the input sentence an individual prompt which asks for its linguistic label. We test our method on the Universal Dependencies part-of-speech tagging dataset for 38 languages, using both English-centric and multilingual LLMs. Our findings show that decomposed prompting surpasses the iterative prompting baseline in efficacy and efficiency under zero- and few-shot settings. Moreover, our analysis of multilingual performance of English-centric LLMs yields insights into the transferability of linguistic knowledge via multilingual prompting.

[46] LLM Targeted Underperformance Disproportionately Impacts Vulnerable Users

Elinor Poole-Dayan, Deb Roy, Jad Kabbara

Main category: cs.CL

TL;DR: LLM response quality varies based on user traits - lower English proficiency, education, and non-US origin users experience more undesirable behaviors like inaccuracies and refusals.

Details

Motivation: To investigate how LLM response quality (accuracy, truthfulness, refusals) varies depending on user characteristics like English proficiency, education level, and country of origin.

Method: Extensive experimentation on three state-of-the-art LLMs using two datasets targeting truthfulness and factuality, analyzing responses across different user trait groups.

Result: Undesirable behaviors in LLMs occur disproportionately more for users with lower English proficiency, lower education status, and those originating from outside the US.

Conclusion: Current LLMs are unreliable information sources for vulnerable user groups, showing systematic bias in response quality based on user demographics.

Abstract: While state-of-the-art large language models (LLMs) have shown impressive performance on many tasks, there has been extensive research on undesirable model behavior such as hallucinations and bias. In this work, we investigate how the quality of LLM responses changes in terms of information accuracy, truthfulness, and refusals depending on three user traits: English proficiency, education level, and country of origin. We present extensive experimentation on three state-of-the-art LLMs and two different datasets targeting truthfulness and factuality. Our findings suggest that undesirable behaviors in state-of-the-art LLMs occur disproportionately more for users with lower English proficiency, of lower education status, and originating from outside the US, rendering these models unreliable sources of information towards their most vulnerable users.

[47] Legal Fact Prediction: The Missing Piece in Legal Judgment Prediction

Junkai Liu, Yujie Tong, Hui Huang, Bowen Zheng, Yiran Hu, Peicheng Wu, Chuan Xiao, Makoto Onizuka, Muyun Yang, Shuyuan Zheng

Main category: cs.CL

TL;DR: Proposes Legal Fact Prediction (LFP) as a new legal NLP task that predicts legal facts from evidence, enabling judgment prediction before legal facts are established.

Details

Motivation: Existing legal judgment prediction relies on established legal facts, which are unavailable early in litigation, limiting practical applicability.

Method: Created LFPBench dataset and developed LFP approach that takes evidence as input to predict legal facts, enabling fact-based judgment prediction without ground-truth legal facts.

Result: Extensive experiments on LFPBench demonstrate effective LFP-empowered legal judgment prediction and identify promising research directions.

Conclusion: LFP addresses key limitation of fact-based legal judgment prediction and enables practical early-stage litigation applications.

Abstract: Legal judgment prediction (LJP), which enables litigants and their lawyers to forecast judgment outcomes and refine litigation strategies, has emerged as a crucial legal NLP task. Existing studies typically utilize legal facts, i.e., facts that have been established by evidence and determined by the judge, to predict the judgment. However, legal facts are often difficult to obtain in the early stages of litigation, significantly limiting the practical applicability of fact-based LJP. To address this limitation, we propose a novel legal NLP task: legal fact prediction (LFP), which takes the evidence submitted by litigants for trial as input to predict legal facts, thereby empowering fact-based LJP technologies to make predictions in the absence of ground-truth legal facts. We also propose the first benchmark dataset, LFPBench, for evaluating the LFP task. Our extensive experiments on LFPBench demonstrate the effectiveness of LFP-empowered LJP and highlight promising research directions for LFP.

[48] DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination

Xuan Gong, Tianshi Ming, Xinpeng Wang, Zhihua Wei

Main category: cs.CL

TL;DR: DAMRO is a training-free method that reduces object hallucination in LVLMs by filtering out high-attention outlier background tokens using ViT’s CLS token during decoding.

Details

Motivation: LVLMs suffer from object hallucination due to attention mechanisms focusing on background tokens rather than referred objects, caused by inherent flaws in the visual encoder.

Method: Uses ViT’s classification token (CLS) to identify and filter high-attention outlier tokens in the background, then eliminates their influence during the decoding stage without requiring additional training.

Result: Significantly reduces hallucination across LVLMs (LLaVA-1.5, LLaVA-NeXT, InstructBLIP) on benchmarks including POPE, CHAIR, MME and GPT-4V Aided Evaluation.

Conclusion: DAMRO effectively alleviates object hallucination in LVLMs by addressing attention distribution issues without model retraining.

Abstract: Despite the great success of Large Vision-Language Models (LVLMs), they inevitably suffer from hallucination. As we know, both the visual encoder and the Large Language Model (LLM) decoder in LVLMs are Transformer-based, allowing the model to extract visual information and generate text outputs via attention mechanisms. We find that the attention distribution of LLM decoder on image tokens is highly consistent with the visual encoder and both distributions tend to focus on particular background tokens rather than the referred objects in the image. We attribute to the unexpected attention distribution to an inherent flaw in the visual encoder itself, which misguides LLMs to over emphasize the redundant information and generate object hallucination. To address the issue, we propose DAMRO, a novel training-free strategy that $D$ive into $A$ttention $M$echanism of LVLM to $R$educe $O$bject Hallucination. Specifically, our approach employs classification token (CLS) of ViT to filter out high-attention outlier tokens scattered in the background and then eliminate their influence during decoding stage. We evaluate our method on LVLMs including LLaVA-1.5, LLaVA-NeXT and InstructBLIP, using various benchmarks such as POPE, CHAIR, MME and GPT-4V Aided Evaluation. The results demonstrate that our approach significantly reduces the impact of these outlier tokens, thus effectively alleviating the hallucination of LVLMs. The code is released at https://github.com/coder-gx/DAMRO.

[49] Who is the root in a syntactic dependency structure?

Ramon Ferrer-i-Cancho, Marta Arias

Main category: cs.CL

TL;DR: The paper addresses the challenge of determining edge direction in syntactic dependency trees by focusing on root vertex identification using centrality scores, showing that root vertices tend to have high centrality.

Details

Motivation: Current unsupervised methods struggle with determining the correct direction of edges in syntactic dependency structures, indicating a lack of fundamental understanding of what constitutes a root vertex.

Method: The authors use an ensemble of centrality scores - both non-spatial (tree-based) and spatial (position-based) - to test the hypothesis that root vertices are central in syntactic dependency structures.

Result: The study confirms that root vertices tend to have high centrality and that highly central vertices tend to be roots. The best performance in root identification comes from novel spatial scores that consider vertex positions and their neighbors.

Conclusion: The research provides theoretical and empirical foundations for understanding rootness from a network science perspective, offering insights toward a universal notion of what makes a vertex a root in syntactic structures.

Abstract: The syntactic structure of a sentence can be described as a tree that indicates the syntactic relationships between words. In spite of significant progress in unsupervised methods that retrieve the syntactic structure of sentences, guessing the right direction of edges is still a challenge. As in a syntactic dependency structure edges are oriented away from the root, the challenge of guessing the right direction can be reduced to finding an undirected tree and the root. The limited performance of current unsupervised methods demonstrates the lack of a proper understanding of what a root vertex is from first principles. We consider an ensemble of centrality scores, some that only take into account the free tree (non-spatial scores) and others that take into account the position of vertices (spatial scores). We test the hypothesis that the root vertex is an important or central vertex of the syntactic dependency structure. We confirm the hypothesis in the sense that root vertices tend to have high centrality and that vertices of high centrality tend to be roots. The best performance in guessing the root is achieved by novel scores that only take into account the position of a vertex and that of its neighbours. We provide theoretical and empirical foundations towards a universal notion of rootness from a network science perspective.

[50] KGGen: Extracting Knowledge Graphs from Plain Text with Language Models

Belinda Mo, Kyssen Yu, Joshua Kazdan, Joan Cabezas, Proud Mpala, Lisa Yu, Chris Cundy, Charilaos Kanatsoulis, Sanmi Koyejo

Main category: cs.CL

TL;DR: KGGen is a text-to-KG generator that uses language models to create high-quality knowledge graphs from plaintext, addressing data scarcity in KG foundation models by clustering related entities to reduce sparsity.

Details

Motivation: Knowledge graph data is scarce - human-labeled KGs are limited while automatically extracted KGs have questionable quality, creating a fundamental challenge for building foundation models for KGs.

Method: Developed KGGen as a Python library that uses language models to extract knowledge graphs from plaintext, with entity clustering to reduce sparsity in the extracted graphs.

Result: KGGen demonstrates far superior performance against existing extractors on the MINE benchmark, which measures a KG extractor’s ability to produce useful graphs from plain text.

Conclusion: KGGen provides a solution to KG data scarcity by generating high-quality graphs from text, making it accessible as an open-source tool for the research community.

Abstract: Recent interest in building foundation models for KGs has highlighted a fundamental challenge: knowledge-graph data is relatively scarce. The best-known KGs are primarily human-labeled, created by pattern-matching, or extracted using early NLP techniques. While human-generated KGs are in short supply, automatically extracted KGs are of questionable quality. We present a solution to this data scarcity problem in the form of a text-to-KG generator (KGGen), a package that uses language models to create high-quality graphs from plaintext. Unlike other KG extractors, KGGen clusters related entities to reduce sparsity in extracted KGs. KGGen is available as a Python library (\texttt{pip install kg-gen}), making it accessible to everyone. Along with KGGen, we release the first benchmark, Measure of of Information in Nodes and Edges (MINE), that tests an extractor’s ability to produce a useful KG from plain text. We benchmark our new tool against existing extractors and demonstrate far superior performance.

[51] Pragmatic Reasoning improves LLM Code Generation

Zhuchen Cao, Sven Apel, Adish Singla, Vera Demberg

Main category: cs.CL

TL;DR: CodeRSA is a novel code candidate reranking mechanism based on the Rational Speech Act framework that improves LLM code generation by better understanding user intent through pragmatic reasoning.

Details

Motivation: User instructions often contain ambiguities that make it challenging for LLMs to generate code that accurately reflects the user's true intent, despite existing approaches that produce multiple candidates and rerank them.

Method: Proposed CodeRSA, a code candidate reranking mechanism built upon the Rational Speech Act (RSA) framework to guide LLMs toward more comprehensive pragmatic reasoning about user intent.

Result: CodeRSA consistently outperforms common baselines, surpasses state-of-the-art approaches in most cases, and demonstrates robust overall performance on HumanEval and MBPP benchmarks using Llama-3-8B-Instruct and Qwen-2.5-7B-Instruct.

Conclusion: Integrating pragmatic reasoning into code candidate reranking is effective for enhancing code generation quality in LLMs, offering a promising direction for future improvements.

Abstract: Large Language Models (LLMs) have demonstrated impressive potential in translating natural language (NL) instructions into program code. However, user instructions often contain inherent ambiguities, making it challenging for LLMs to generate code that accurately reflects the user’s true intent. To address this challenge, researchers have proposed approaches that produce multiple candidates of the program code and then rerank them to identify the best solution. In this paper, we propose CodeRSA, a novel code candidate reranking mechanism built upon the Rational Speech Act (RSA) framework, designed to guide LLMs toward more comprehensive pragmatic reasoning about user intent. We evaluate CodeRSA using Llama-3-8B-Instruct and Qwen-2.5-7B-Instruct on two widely used code generation benchmarks, HumanEval and MBPP. Our experiment results show that CodeRSA consistently outperforms common baselines, surpasses the state-of-the-art approach in most cases, and demonstrates robust overall performance. These findings underscore the effectiveness of integrating pragmatic reasoning into code candidate reranking, offering a promising direction for enhancing code generation quality in LLMs.

[52] GraphCheck: Multipath Fact-Checking with Entity-Relationship Graphs

Hyewon Jeon, Jay-Yoon Lee

Main category: cs.CL

TL;DR: GraphCheck transforms claims into entity-relationship graphs for systematic fact-checking, with DP-GraphCheck adding adaptive strategy selection between direct prompting and graph-based reasoning.

Details

Motivation: Verifying complex claims requiring multi-hop reasoning remains challenging in automated fact-checking, necessitating more structured and robust approaches.

Method: Proposes GraphCheck framework that converts claims into entity-relationship graphs to model explicit/latent entities and explore multiple reasoning paths, plus DP-GraphCheck variant with lightweight strategy selector for adaptive reasoning.

Result: Outperforms existing methods on HOVER and EX-FEVER datasets in verification accuracy while maintaining computational efficiency; strategy selection generalizes well to other pipelines.

Conclusion: The framework provides robust verification for complex claims through structured graph-based reasoning, with adaptive strategy selection balancing accuracy and efficiency across different claim complexities.

Abstract: Automated fact-checking aims to assess the truthfulness of textual claims based on relevant evidence. However, verifying complex claims that require multi-hop reasoning remains a significant challenge. We propose GraphCheck, a novel framework that transforms claims into entity-relationship graphs for structured and systematic fact-checking. By explicitly modeling both explicit and latent entities and exploring multiple reasoning paths, GraphCheck enhances verification robustness. While GraphCheck excels in complex scenarios, it may be unnecessarily elaborate for simpler claims. To address this, we introduce DP-GraphCheck, a variant that employs a lightweight strategy selector to choose between direct prompting and GraphCheck adaptively. This selective mechanism improves both accuracy and efficiency by applying the appropriate level of reasoning to each claim. Experiments on the HOVER and EX-FEVER datasets demonstrate that our approach outperforms existing methods in verification accuracy, while achieving strong computational efficiency despite its multipath exploration. Moreover, the strategy selection mechanism in DP-GraphCheck generalizes well to other fact-checking pipelines, highlighting the broad applicability of our framework.

[53] Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality

Alex Fang, Hadi Pouransari, Matt Jordan, Alexander Toshev, Vaishaal Shankar, Ludwig Schmidt, Tom Gunter

Main category: cs.CL

TL;DR: Repeating aggressively filtered datasets for multiple epochs can outperform training on larger datasets for single epoch across various compute budgets, with document-level repetition optimization further improving performance.

Details

Motivation: As LLM compute budgets grow, limited data volume from heavily filtered datasets becomes a constraint, requiring better understanding of how to optimize data usage.

Method: Study model performance across compute budgets and pre-training datasets, modify training recipes for dataset repetition, and investigate document-level repetition manipulation.

Result: Repeating filtered datasets for up to 10 epochs outperforms single-epoch training on 10x larger supersets; document-level repetition optimization creates better datasets per token budget.

Conclusion: Data filtering remains crucial research direction even as LLMs scale, with dataset repetition and document-level optimization providing effective performance improvements.

Abstract: Data filtering has become a powerful tool for improving model performance while reducing computational cost. However, as large language model compute budgets continue to grow, the limited data volume provided by heavily filtered and deduplicated datasets will become a practical constraint. In efforts to better understand how to proceed, we study model performance at various compute budgets and across multiple pre-training datasets created through data filtering and deduplication. We find that, given appropriate modifications to the training recipe, repeating existing aggressively filtered datasets for up to ten epochs can outperform training on the ten times larger superset for a single epoch across multiple compute budget orders of magnitude. While this finding relies on repeating the dataset for many epochs, we also investigate repeats within these datasets at the document level. We find that not all documents within a dataset are equal, and we can create better datasets relative to a token budget by explicitly manipulating the counts of individual documents. We conclude by arguing that even as large language models scale, data filtering remains an important direction of research.

[54] Efficient Model Development through Fine-tuning Transfer

Pin-Jie Lin, Rishab Balasubramanian, Fengyuan Liu, Nikhil Kandpal, Tu Vu

Main category: cs.CL

TL;DR: Transferring fine-tuning updates (diff vectors) between different model versions can significantly improve target model performance without additional training, offering a cost-efficient strategy for continuous LLM development.

Details

Motivation: Modern LLMs require expensive alignment processes to be repeated for each new model version, creating inefficiency in model updates and domain-specific adaptations.

Method: Extract diff vectors (weight changes from fine-tuning) from source models and apply them to base models of different target versions, leveraging linear connectivity in parameter space.

Result: Transferred updates improved Llama 3.1 8B by 46.9% on IFEval and 15.7% on LiveCodeBench, surpassing Llama 3.1 8B Instruct. Also achieved 4.7-15.5% improvements on multilingual tasks and provided better initializations for further fine-tuning.

Conclusion: Fine-tuning transfer is an effective, cost-efficient approach for continuous LLM development, especially when source and target models are linearly connected in parameter space.

Abstract: Modern LLMs struggle with efficient updates, as each new pretrained model version requires repeating expensive alignment processes. This challenge also applies to domain- or languagespecific models, where fine-tuning on specialized data must be redone for every new base model release. In this paper, we explore the transfer of fine-tuning updates between model versions. Specifically, we derive the diff vector (representing the weight changes from finetuning) from one source model version and apply it to the base model of a different target version. Through empirical evaluations on various open-weight model versions, we show that transferring diff vectors can significantly improve the performance of the target base model. For example, transferring the fine-tuning updates from Llama 3.0 8B improves Llama 3.1 8B by 46.9% on IFEval and 15.7% on LiveCodeBench without additional training, even surpassing Llama 3.1 8B Instruct. Furthermore, we demonstrate performance gains on multilingual tasks, with 4.7% and 15.5% improvements on Global MMLU for Malagasy and Turkish, respectively. We observe that these merged models provide stronger initializations for further fine-tuning. Lastly, our controlled experiments suggest that fine-tuning transfer is most effective when source and target models lie in a linearly connected region of parameter space, and we provide a theoretical analysis of our method. Taken together, fine-tuning transfer offers a cost-efficient and practical strategy for continuous LLM development. Our code is available at github.com/pjlintw/finetuning-transfer.

[55] TathyaNyaya and FactLegalLlama: Advancing Factual Judgment Prediction and Explanation in the Indian Legal Context

Shubham Kumar Nigam, Balaramamahanthi Deepak Patnaik, Shivam Mishra, Noel Shallum, Kripabandhu Ghosh, Arnab Bhattacharya

Main category: cs.CL

TL;DR: This paper introduces TathyaNyaya, the largest annotated dataset for Fact-based Judgment Prediction and Explanation (FJPE) in the Indian legal context, and FactLegalLlama, an instruction-tuned LLM optimized for generating high-quality explanations in FJPE tasks.

Details

Motivation: To develop robust and realistic AI-driven decision-making tools for legal systems by focusing on factual data rather than complete legal texts, reflecting real-world judicial processes where factual data drives outcomes.

Method: Created TathyaNyaya dataset from Indian Supreme Court and High Court judgments, focusing on factual statements. Developed FactLegalLlama by instruction-tuning LLaMa-3-8B on the factual data. Combined transformers for binary judgment prediction with FactLegalLlama for explanation generation.

Result: TathyaNyaya surpasses existing datasets in scale and diversity. The framework enhances predictive accuracy with coherent, contextually relevant explanations, addressing transparency and interpretability needs in AI-assisted legal systems.

Conclusion: Factual precision and domain-specific tuning are crucial for enhancing predictive performance and interpretability in AI-assisted legal decision-making. TathyaNyaya and FactLegalLlama establish benchmarks for explainable AI systems in legal analysis.

Abstract: In the landscape of Fact-based Judgment Prediction and Explanation (FJPE), reliance on factual data is essential for developing robust and realistic AI-driven decision-making tools. This paper introduces TathyaNyaya, the largest annotated dataset for FJPE tailored to the Indian legal context, encompassing judgments from the Supreme Court of India and various High Courts. Derived from the Hindi terms “Tathya” (fact) and “Nyaya” (justice), the TathyaNyaya dataset is uniquely designed to focus on factual statements rather than complete legal texts, reflecting real-world judicial processes where factual data drives outcomes. Complementing this dataset, we present FactLegalLlama, an instruction-tuned variant of the LLaMa-3-8B Large Language Model (LLM), optimized for generating high-quality explanations in FJPE tasks. Finetuned on the factual data in TathyaNyaya, FactLegalLlama integrates predictive accuracy with coherent, contextually relevant explanations, addressing the critical need for transparency and interpretability in AI-assisted legal systems. Our methodology combines transformers for binary judgment prediction with FactLegalLlama for explanation generation, creating a robust framework for advancing FJPE in the Indian legal domain. TathyaNyaya not only surpasses existing datasets in scale and diversity but also establishes a benchmark for building explainable AI systems in legal analysis. The findings underscore the importance of factual precision and domain-specific tuning in enhancing predictive performance and interpretability, positioning TathyaNyaya and FactLegalLlama as foundational resources for AI-assisted legal decision-making.

[56] Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards

Manveer Singh Tamber, Forrest Sheng Bao, Chenyu Xu, Ge Luo, Suleman Kazi, Minseok Bae, Miaoran Li, Ofer Mendelevitch, Renyi Qu, Jimmy Lin

Main category: cs.CL

TL;DR: Vectara introduces FaithJudge, an LLM-as-a-judge framework for improved hallucination detection in RAG systems, and launches an enhanced hallucination leaderboard to benchmark LLM faithfulness across multiple tasks.

Details

Motivation: Current RAG systems still suffer from hallucinations where LLMs introduce unsupported information or contradictions even when provided with relevant context, highlighting the need for better hallucination detection and benchmarking methods.

Method: Developed FaithJudge framework using LLM-as-a-judge approach with diverse human-annotated hallucination examples, and created enhanced hallucination leaderboard tracking LLM performance on summarization, QA, and data-to-text generation tasks.

Result: FaithJudge substantially improves automated hallucination evaluation of LLMs and enables more reliable benchmarking of LLM hallucinations in RAG systems.

Conclusion: FaithJudge supports development of more trustworthy generative AI systems by providing better hallucination detection and benchmarking capabilities for RAG applications.

Abstract: Retrieval-augmented generation (RAG) aims to reduce hallucinations by grounding responses in external context, yet large language models (LLMs) still frequently introduce unsupported information or contradictions even when provided with relevant context. This paper presents two complementary efforts at Vectara to measure and benchmark LLM faithfulness in RAG. First, we describe our original hallucination leaderboard, which has tracked hallucination rates for LLMs since 2023 using our HHEM hallucination detection model. Motivated by limitations observed in current hallucination detection methods, we introduce FaithJudge, an LLM-as-a-judge framework that leverages a pool of diverse human-annotated hallucination examples to substantially improve the automated hallucination evaluation of LLMs. We introduce an enhanced hallucination leaderboard centered on FaithJudge that benchmarks LLMs on RAG faithfulness in summarization, question-answering, and data-to-text generation tasks. FaithJudge enables a more reliable benchmarking of LLM hallucinations in RAG and supports the development of more trustworthy generative AI systems: https://github.com/vectara/FaithJudge.

[57] What Are They Talking About? A Benchmark of Knowledge-Grounded Discussion Summarization

Weixiao Zhou, Junnan Zhu, Gengyao Li, Xianfu Cheng, Xinnian Liang, Feifei Zhai, Zhoujun Li

Main category: cs.CL

TL;DR: Introduces Knowledge-Grounded Discussion Summarization (KGDS) to address limitations of traditional dialogue summarization by generating background context summaries alongside opinion summaries with clarified references.

Details

Motivation: Traditional dialogue summarization fails when discussions rely on shared background knowledge, leading to confusing summaries for unfamiliar readers due to omitted context and implicit references.

Method: Created first KGDS benchmark with news-discussion pairs and expert annotations, proposed hierarchical evaluation framework with fine-grained metrics, evaluated 12 advanced LLMs.

Result: KGDS remains challenging - models miss key facts and retain irrelevant ones in background summaries, fail to resolve implicit references in opinion summaries.

Conclusion: Knowledge-grounded summarization is a significant unsolved problem requiring better context understanding and reference resolution capabilities.

Abstract: Traditional dialogue summarization primarily focuses on dialogue content, assuming it comprises adequate information for a clear summary. However, this assumption often fails for discussions grounded in shared background, where participants frequently omit context and use implicit references. This results in summaries that are confusing to readers unfamiliar with the background. To address this, we introduce Knowledge-Grounded Discussion Summarization (KGDS), a novel task that produces a supplementary background summary for context and a clear opinion summary with clarified references. To facilitate research, we construct the first KGDS benchmark, featuring news-discussion pairs and expert-created multi-granularity gold annotations for evaluating sub-summaries. We also propose a novel hierarchical evaluation framework with fine-grained and interpretable metrics. Our extensive evaluation of 12 advanced large language models (LLMs) reveals that KGDS remains a significant challenge. The models frequently miss key facts and retain irrelevant ones in background summarization, and often fail to resolve implicit references in opinion summary integration.

[58] On Multilingual Encoder Language Model Compression for Low-Resource Languages

Daniil Gurgurov, Michal Gregor, Josef van Genabith, Simon Ostermann

Main category: cs.CL

TL;DR: Combines knowledge distillation, pruning, truncation, and vocabulary trimming to compress multilingual encoder models for low-resource languages by up to 92% with moderate performance drops.

Details

Motivation: To create significantly smaller monolingual models for low-resource languages while retaining essential language-specific knowledge through extreme compression techniques.

Method: Systematically combines two-step knowledge distillation, structured pruning, truncation, and vocabulary trimming to reduce layer depth, feed-forward hidden size, and intermediate layer embedding size.

Result: Achieved compression rates up to 92% with average performance drops of 2-10% for moderate compression and 8-13% for maximum compression across sentiment analysis, topic classification, NER, and POS tagging in three low-resource languages.

Conclusion: Performance degradation correlates with language-specific data in teacher models, with larger datasets resulting in smaller losses; ablation studies identified best practices for multilingual model compression.

Abstract: In this paper, we combine two-step knowledge distillation, structured pruning, truncation, and vocabulary trimming for extremely compressing multilingual encoder-only language models for low-resource languages. Our novel approach systematically combines existing techniques and takes them to the extreme, reducing layer depth, feed-forward hidden size, and intermediate layer embedding size to create significantly smaller monolingual models while retaining essential language-specific knowledge. We achieve compression rates of up to 92% while maintaining competitive performance, with average drops of 2-10% for moderate compression and 8-13% at maximum compression in four downstream tasks, including sentiment analysis, topic classification, named entity recognition, and part-of-speech tagging, across three low-resource languages. Notably, the performance degradation correlates with the amount of language-specific data in the teacher model, with larger datasets resulting in smaller performance losses. Additionally, we conduct ablation studies to identify the best practices for multilingual model compression using these techniques.

[59] Compression Hacking: A Supplementary Perspective on Informatics Properties of Language Models from Geometric Distortion

Jianxiang Zang, Meiling Ning, Yongda Wei, Shihan Dou, Jiazheng Zhang, Nijia Mo, Binhong Li, Tao Gui, Qi Zhang, Xuanjing Huang

Main category: cs.CL

TL;DR: The paper identifies ‘compression hacking’ in language models where high compression rates create anisotropic representations that hurt performance. It proposes refined compression metrics with geometric distortion analysis that better correlate with LM capabilities.

Details

Motivation: Current 'compression as intelligence' metrics can be misleading because highly compressed LMs develop anisotropic representations that actually hinder performance, creating a 'compression hacking' phenomenon.

Method: Proposed three refined compression metrics incorporating geometric distortion analysis and integrated them into a self-evaluation pipeline to detect and correct for compression hacking.

Result: The refined metrics achieved Spearman correlation coefficients above 0.9 with LM comprehensive capabilities, significantly outperforming original compression metrics and other internal structure-based metrics.

Conclusion: Incorporating geometric distortion analysis into compression metrics substantially enhances the informatics interpretation of language models by addressing the compression hacking problem.

Abstract: Recently, the concept of compression as intelligence'' has provided a novel informatics metric perspective for language models (LMs), emphasizing that highly structured representations signify the intelligence level of LMs. However, from a geometric standpoint, the word representation space of highly compressed LMs tends to degenerate into a highly anisotropic state, which hinders the LM's ability to comprehend instructions and directly impacts its performance. We found this compression-anisotropy synchronicity is essentially the Compression Hacking’’ in LM representations, where noise-dominated directions tend to create the illusion of high compression rates by sacrificing spatial uniformity. Based on this, we propose three refined compression metrics by incorporating geometric distortion analysis and integrate them into a self-evaluation pipeline. The refined metrics exhibit strong alignment with the LM’s comprehensive capabilities, achieving Spearman correlation coefficients above 0.9, significantly outperforming both the original compression and other internal structure-based metrics. This confirms that compression hacking substantially enhances the informatics interpretation of LMs by incorporating geometric distortion of representations.

[60] Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics

Pankaj Kumar, Subhankar Mishra

Main category: cs.CL

TL;DR: This survey provides a comprehensive overview of robustness in Large Language Models (LLMs), examining conceptual foundations, sources of non-robustness, mitigation strategies, evaluation benchmarks, and future research directions.

Details

Motivation: Ensuring the robustness of LLMs remains a critical challenge despite their emergence as promising foundations for NLP and AI. The survey aims to address these challenges and advance the field by systematically examining robustness issues.

Method: The survey systematically examines the nature of robustness in LLMs, analyzes sources of non-robustness (intrinsic model limitations, data-driven vulnerabilities, external adversarial factors), reviews state-of-the-art mitigation strategies, and discusses evaluation benchmarks and metrics.

Result: The paper synthesizes findings from existing surveys and interdisciplinary studies to provide a comprehensive understanding of LLM robustness challenges, current mitigation approaches, and evaluation methodologies.

Conclusion: The survey highlights trends, unresolved issues, and pathways for future research in LLM robustness, emphasizing the importance of consistent performance across diverse inputs and addressing real-world reliability gaps.

Abstract: Large Language Models (LLMs) have emerged as a promising cornerstone for the development of natural language processing (NLP) and artificial intelligence (AI). However, ensuring the robustness of LLMs remains a critical challenge. To address these challenges and advance the field, this survey provides a comprehensive overview of current studies in this area. First, we systematically examine the nature of robustness in LLMs, including its conceptual foundations, the importance of consistent performance across diverse inputs, and the implications of failure modes in real-world applications. Next, we analyze the sources of non-robustness, categorizing intrinsic model limitations, data-driven vulnerabilities, and external adversarial factors that compromise reliability. Following this, we review state-of-the-art mitigation strategies, and then we discuss widely adopted benchmarks, emerging metrics, and persistent gaps in assessing real-world reliability. Finally, we synthesize findings from existing surveys and interdisciplinary studies to highlight trends, unresolved issues, and pathways for future research.

[61] Reasoning Models Hallucinate More: Factuality-Aware Reinforcement Learning for Large Reasoning Models

Junyi Li, Hwee Tou Ng

Main category: cs.CL

TL;DR: RL fine-tuning for reasoning in LLMs increases hallucinations due to training dynamics. FSPO addresses this by incorporating step-wise factuality verification to reduce hallucinations while improving reasoning accuracy.

Details

Motivation: While RL fine-tuning improves LLM reasoning capabilities, it significantly increases hallucinations, creating a reliability issue that needs to be addressed.

Method: Proposed FSPO (Factuality-aware Step-wise Policy Optimization) - an RL algorithm that incorporates explicit factuality verification at each reasoning step using automated verification against evidence to adjust token-level advantage values.

Result: Experiments on mathematical reasoning and hallucination benchmarks with Qwen2.5 and Llama models show FSPO effectively reduces hallucinations while enhancing reasoning accuracy, improving both reliability and performance.

Conclusion: FSPO successfully addresses the hallucination problem in RL-fine-tuned reasoning models by integrating step-wise factuality verification, achieving better balance between reasoning capability and factual reliability.

Abstract: Large language models (LLMs) have significantly advanced in reasoning tasks through reinforcement learning (RL) optimization, achieving impressive capabilities across various challenging benchmarks. However, our empirical analysis reveals a critical drawback: reasoning-oriented RL fine-tuning significantly increases the prevalence of hallucinations. We theoretically analyze the RL training dynamics, identifying high-variance gradient, entropy-induced randomness, and susceptibility to spurious local optima as key factors leading to hallucinations. To address this drawback, we propose Factuality-aware Step-wise Policy Optimization (FSPO), an innovative RL fine-tuning algorithm incorporating explicit factuality verification at each reasoning step. FSPO leverages automated verification against given evidence to dynamically adjust token-level advantage values, incentivizing factual correctness throughout the reasoning process. Experiments across mathematical reasoning and hallucination benchmarks using Qwen2.5 and Llama models demonstrate that FSPO effectively reduces hallucinations while enhancing reasoning accuracy, substantially improving both reliability and performance.

[62] Homogeneous Keys, Heterogeneous Values: Exploiting Local KV Cache Asymmetry for Long-Context LLMs

Wanyun Cui, Mingwei Xu

Main category: cs.CL

TL;DR: AsymKV is a training-free KV cache compression framework that leverages key-value asymmetry - keys show local homogeneity while values are heterogeneous - to achieve superior long-context performance through key merging and lossless value compression.

Details

Motivation: The quadratic complexity of attention mechanisms in LLMs creates challenges for long-context modeling. Existing KV cache compression methods treat keys and values uniformly, overlooking fundamental asymmetry between them.

Method: Proposes AsymKV framework that combines homogeneity-based key merging with mathematically proven lossless value compression, exploiting the observed asymmetry where adjacent keys have similar attention weights but adjacent values have heterogeneous distributions.

Result: AsymKV consistently outperforms existing long-context methods across various tasks and base models. On LLaMA3.1-8B, it achieves 43.95 average score on LongBench, significantly surpassing SOTA methods like H2O (38.89).

Conclusion: The discovered key-value asymmetry enables more effective KV cache compression, and AsymKV provides a training-free solution that significantly improves long-context modeling efficiency and performance.

Abstract: Recent advances in Large Language Models (LLMs) have highlighted the critical importance of extending context length, yet the quadratic complexity of attention mechanisms poses significant challenges for efficient long-context modeling. KV cache compression has emerged as a key approach to address this challenge. Through extensive empirical analysis, we reveal a fundamental yet previously overlooked asymmetry in KV caches: while adjacent keys receive similar attention weights ({\it local homogeneity}), adjacent values demonstrate distinct {\it heterogeneous} distributions. This key-value asymmetry reveals a critical limitation in existing compression methods that treat keys and values uniformly. To address the limitation, we propose a training-free compression framework (AsymKV) that combines homogeneity-based key merging with a mathematically proven lossless value compression. Extensive experiments demonstrate that AsymKV consistently outperforms existing long-context methods across various tasks and base models. For example, on LLaMA3.1-8B, AsymKV achieves an average score of 43.95 on LongBench, surpassing SOTA methods like H$_2$O (38.89) by a large margin.Our code can be found in this link:https://github.com/the-scale-lab/Asymkv.

[63] TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs

Ezgi Başar, Francesca Padovani, Jaap Jumelet, Arianna Bisazza

Main category: cs.CL

TL;DR: TurBLiMP is the first Turkish benchmark for evaluating language models using 16,000 minimal pairs across 16 linguistic phenomena, with special focus on Turkish word order flexibility and morphological subordination.

Details

Motivation: To fill the gap in linguistic evaluation resources for Turkish and specifically study word order flexibility and morphological subordination, which are understudied in current LM evaluations.

Method: Created a benchmark with 16,000 minimal pairs (1000 per phenomenon) covering 16 linguistic phenomena, with extra attention to Turkish-specific features. Evaluated various LMs and collected human acceptability judgments for comparison.

Result: Large language models still struggle with grammatical phenomena that are easy for humans, and show different sensitivities to word order and morphological complexity compared to human judgments.

Conclusion: Even state-of-the-art language models have limitations in handling Turkish-specific linguistic phenomena, particularly word order flexibility and morphological complexity, highlighting the need for language-specific evaluation benchmarks.

Abstract: We introduce TurBLiMP, the first Turkish benchmark of linguistic minimal pairs, designed to evaluate the linguistic abilities of monolingual and multilingual language models (LMs). Covering 16 linguistic phenomena with 1000 minimal pairs each, TurBLiMP fills an important gap in linguistic evaluation resources for Turkish. In designing the benchmark, we give extra attention to two properties of Turkish that remain understudied in current syntactic evaluations of LMs, namely word order flexibility and subordination through morphological processes. Our experiments on a wide range of LMs and a newly collected set of human acceptability judgments reveal that even cutting-edge Large LMs still struggle with grammatical phenomena that are not challenging for humans, and may also exhibit different sensitivities to word order and morphological complexity compared to humans.

[64] FinEval-KR: A Financial Domain Evaluation Framework for Large Language Models’ Knowledge and Reasoning

Shaoyu Dou, Yutian Shen, Mofan Chen, Zixuan Wang, Jiajie Xu, Qi Guo, Kailai Shao, Chao Chen, Haixiang Hu, Haibo Shi, Min Min, Liwen Zhang

Main category: cs.CL

TL;DR: FinEval-KR is a novel evaluation framework that decouples and quantifies LLMs’ knowledge and reasoning abilities in financial tasks, revealing that reasoning ability and higher-order cognitive skills are key factors for accuracy, with specialized financial LLMs generally lagging behind top general models.

Details

Motivation: Current evaluation benchmarks for LLMs in financial reasoning fail to decouple knowledge and reasoning capabilities and lack root cause analysis for task failures, making it difficult to understand the core factors affecting performance.

Method: Introduced FinEval-KR framework with distinct knowledge and reasoning score metrics, plus a cognitive score based on Bloom’s taxonomy. Released a new open-source Chinese financial reasoning dataset covering 22 subfields for reproducible research.

Result: Experimental results show LLM reasoning ability and higher-order cognitive ability are core factors influencing reasoning accuracy. Top models still face bottlenecks with knowledge application, and specialized financial LLMs generally lag behind top general large models.

Conclusion: The study highlights the importance of decoupling knowledge and reasoning capabilities in LLM evaluation, reveals critical performance gaps in financial reasoning tasks, and provides a framework and dataset for advancing financial reasoning research.

Abstract: Large Language Models (LLMs) demonstrate significant potential but face challenges in complex financial reasoning tasks requiring both domain knowledge and sophisticated reasoning. Current evaluation benchmarks often fall short by not decoupling these capabilities indicators from single task performance and lack root cause analysis for task failure. To address this, we introduce FinEval-KR, a novel evaluation framework for decoupling and quantifying LLMs’ knowledge and reasoning abilities independently, proposing distinct knowledge score and reasoning score metrics. Inspired by cognitive science, we further propose a cognitive score based on Bloom’s taxonomy to analyze capabilities in reasoning tasks across different cognitive levels. We also release a new open-source Chinese financial reasoning dataset covering 22 subfields to support reproducible research and further advancements in financial reasoning. Our experimental results reveal that LLM reasoning ability and higher-order cognitive ability are the core factors influencing reasoning accuracy. We also specifically find that even top models still face a bottleneck with knowledge application. Furthermore, our analysis shows that specialized financial LLMs generally lag behind the top general large models across multiple metrics.

[65] Text2VectorSQL: Towards a Unified Interface for Vector Search and SQL Queries

Zhengren Wang, Dongwen Yao, Bozhou Li, Dongsheng Ma, Bo Li, Zhiyu Li, Feiyu Xiong, Bin Cui, Linpeng Tang, Wentao Zhang

Main category: cs.CL

TL;DR: Text2VectorSQL is introduced as a unified natural language interface for querying both structured and unstructured data, addressing limitations of Text-to-SQL and vector search.

Details

Motivation: Traditional database interfaces struggle with unstructured data, while Text-to-SQL can't handle semantic/multi-modal queries and vector search lacks standardized evaluation methods when integrated with SQL.

Method: Created a comprehensive ecosystem including: scalable pipeline for synthesizing training data, VectorSQLBench benchmark (12 combinations across 3 database backends and 4 data sources), and novel evaluation metrics.

Result: Experiments show strong baseline performance but reveal the recall degradation challenge - SQL filters with vector search cause more result omissions than conventional filtered vector search.

Conclusion: This work lays essential groundwork for next-generation unified data interfaces by defining the core task, providing infrastructure, and identifying key research challenges.

Abstract: The proliferation of unstructured data poses a fundamental challenge to traditional database interfaces. While Text-to-SQL has democratized access to structured data, it remains incapable of interpreting semantic or multi-modal queries. Concurrently, vector search has emerged as the de facto standard for querying unstructured data, but its integration with SQL-termed VectorSQL-still relies on manual query crafting and lacks standardized evaluation methodologies, creating a significant gap between its potential and practical application. To bridge this fundamental gap, we introduce and formalize Text2VectorSQL, a novel task to establish a unified natural language interface for seamlessly querying both structured and unstructured data. To catalyze research in this new domain, we present a comprehensive foundational ecosystem, including: (1) A scalable and robust pipeline for synthesizing high-quality Text-to-VectorSQL training data. (2) VectorSQLBench, the first large-scale, multi-faceted benchmark for this task, encompassing 12 distinct combinations across three database backends (SQLite, PostgreSQL, ClickHouse) and four data sources (BIRD, Spider, arXiv, Wikipedia). (3) Several novel evaluation metrics designed for more nuanced performance analysis. Extensive experiments not only confirm strong baseline performance with our trained models, but also reveal the recall degradation challenge: the integration of SQL filters with vector search can lead to more pronounced result omissions than in conventional filtered vector search. By defining the core task, delivering the essential data and evaluation infrastructure, and identifying key research challenges, our work lays the essential groundwork to build the next generation of unified and intelligent data interfaces. Our repository is available at https://github.com/OpenDCAI/Text2VectorSQL.

[66] Distillation versus Contrastive Learning: How to Train Your Rerankers

Zhichao Xu, Zhiqi Huang, Shengyao Zhuang, Vivek Srikumar

Main category: cs.CL

TL;DR: Knowledge distillation generally outperforms contrastive learning for training cross-encoder rerankers when using a more performant teacher model, but contrastive learning remains effective when no superior teacher is available.

Details

Motivation: To provide a clear comparison between contrastive learning and knowledge distillation strategies for training text rerankers under practical conditions, as both methods are widely used but their relative effectiveness hasn't been systematically compared.

Method: Empirically trained rerankers of different sizes (0.5B, 1.5B, 3B, 7B) and architectures (Transformer, Recurrent) using both contrastive learning and knowledge distillation on the same data, with a strong contrastive learning model serving as the distillation teacher.

Result: Knowledge distillation yields better in-domain and out-of-domain ranking performance than contrastive learning when distilling from a more performant teacher model, consistent across student model sizes and architectures. However, distilling from a teacher of the same capacity doesn’t provide the same advantage, especially for out-of-domain tasks.

Conclusion: Use knowledge distillation to train smaller rerankers when a larger, more performant teacher is available; otherwise, contrastive learning remains a robust baseline. The findings offer practical guidance for choosing training strategies based on available teacher models.

Abstract: Training effective text rerankers is crucial for information retrieval. Two strategies are widely used: contrastive learning (optimizing directly on ground-truth labels) and knowledge distillation (transferring knowledge from a larger reranker). While both have been studied extensively, a clear comparison of their effectiveness for training cross-encoder rerankers under practical conditions is needed. This paper empirically compares these strategies by training rerankers of different sizes (0.5B, 1.5B, 3B, 7B) and architectures (Transformer, Recurrent) using both methods on the same data, with a strong contrastive learning model acting as the distillation teacher. Our results show that knowledge distillation generally yields better in-domain and out-of-domain ranking performance than contrastive learning when distilling from a more performant teacher model. This finding is consistent across student model sizes and architectures. However, distilling from a teacher of the same capacity does not provide the same advantage, particularly for out-of-domain tasks. These findings offer practical guidance for choosing a training strategy based on available teacher models. We recommend using knowledge distillation to train smaller rerankers if a larger, more performant teacher is accessible; in its absence, contrastive learning remains a robust baseline. Our code implementation is made available to facilitate reproducbility.

[67] XL-DURel: Finetuning Sentence Transformers for Ordinal Word-in-Context Classification

Sachin Yadav, Dominik Schlechtweg

Main category: cs.CL

TL;DR: XL-DURel is a multilingual Sentence Transformer model optimized for ordinal Word-in-Context classification, using ranking objectives based on angular distance in complex space to outperform previous models.

Details

Motivation: To improve Word-in-Context classification by treating binary WiC as a special case of ordinal WiC, enabling unified treatment across different task formulations.

Method: Fine-tuned multilingual Sentence Transformer with various loss functions for regression and ranking tasks, using angular distance in complex space for ranking objectives.

Result: Outperformed previous models on both ordinal and binary WiC data, showing that optimizing for ordinal tasks improves performance on binary tasks.

Conclusion: Binary WiC can be effectively treated as a special case of ordinal WiC, paving the way for unified WiC modeling across different task formulations.

Abstract: We propose XL-DURel, a finetuned, multilingual Sentence Transformer model optimized for ordinal Word-in-Context classification. We test several loss functions for regression and ranking tasks managing to outperform previous models on ordinal and binary data with a ranking objective based on angular distance in complex space. We further show that binary WiC can be treated as a special case of ordinal WiC and that optimizing models for the general ordinal task improves performance on the more specific binary task. This paves the way for a unified treatment of WiC modeling across different task formulations.

[68] DYNARTmo: A Dynamic Articulatory Model for Visualization of Speech Movement Patterns

Bernd J. Kröger

Main category: cs.CL

TL;DR: DYNARTmo is a dynamic articulatory model for visualizing speech articulation in 2D midsagittal plane, built on UK-DYNAMO framework with web-based implementation for phonetics education and speech therapy.

Details

Motivation: To create a comprehensive articulatory model that can visualize speech processes for educational and therapeutic applications, addressing the need for accessible tools in phonetics and speech therapy.

Method: Integrates articulatory underspecification, segmental/gestural control, and coarticulation principles. Simulates six key articulators using ten continuous and six discrete control parameters to generate vocalic and consonantal configurations.

Result: Successfully developed a web-based application (SpeechArticulationTrainer) with sagittal, glottal, and palatal views for visualizing articulatory configurations.

Conclusion: DYNARTmo provides a functional static articulatory model suitable for educational purposes, with future work planned for dynamic movement generation and articulatory-acoustic integration.

Abstract: We present DYNARTmo, a dynamic articulatory model designed to visualize speech articulation processes in a two-dimensional midsagittal plane. The model builds upon the UK-DYNAMO framework and integrates principles of articulatory underspecification, segmental and gestural control, and coarticulation. DYNARTmo simulates six key articulators based on ten continuous and six discrete control parameters, allowing for the generation of both vocalic and consonantal articulatory configurations. The current implementation is embedded in a web-based application (SpeechArticulationTrainer) that includes sagittal, glottal, and palatal views, making it suitable for use in phonetics education and speech therapy. While this paper focuses on the static modeling aspects, future work will address dynamic movement generation and integration with articulatory-acoustic modules.

[69] NyayaRAG: Realistic Legal Judgment Prediction with RAG under the Indian Common Law System

Shubham Kumar Nigam, Balaramamahanthi Deepak Patnaik, Shivam Mishra, Ajay Varghese Thomas, Noel Shallum, Kripabandhu Ghosh, Arnab Bhattacharya

Main category: cs.CL

TL;DR: NyayaRAG is a RAG framework for Indian legal judgment prediction that combines case facts with relevant statutes and precedents, improving both prediction accuracy and explanation quality.

Details

Motivation: Previous Indian LJP approaches overlooked key common law elements like statutory provisions and judicial precedents, focusing only on internal case content.

Method: Proposed NyayaRAG framework that simulates courtroom scenarios by providing factual case descriptions, relevant legal statutes, and semantically retrieved prior cases using a domain-specific pipeline for Indian law.

Result: Augmenting factual inputs with structured legal knowledge significantly improves both predictive accuracy and explanation quality, as evaluated using standard metrics and LLM-based evaluators like G-Eval.

Conclusion: Combining factual inputs with legal knowledge through RAG framework enhances legal judgment prediction performance and interpretability in common law systems.

Abstract: Legal Judgment Prediction (LJP) has emerged as a key area in AI for law, aiming to automate judicial outcome forecasting and enhance interpretability in legal reasoning. While previous approaches in the Indian context have relied on internal case content such as facts, issues, and reasoning, they often overlook a core element of common law systems, which is reliance on statutory provisions and judicial precedents. In this work, we propose NyayaRAG, a Retrieval-Augmented Generation (RAG) framework that simulates realistic courtroom scenarios by providing models with factual case descriptions, relevant legal statutes, and semantically retrieved prior cases. NyayaRAG evaluates the effectiveness of these combined inputs in predicting court decisions and generating legal explanations using a domain-specific pipeline tailored to the Indian legal system. We assess performance across various input configurations using both standard lexical and semantic metrics as well as LLM-based evaluators such as G-Eval. Our results show that augmenting factual inputs with structured legal knowledge significantly improves both predictive accuracy and explanation quality.

[70] RPRO: Ranked Preference Reinforcement Optimization for Enhancing Medical QA and Diagnostic Reasoning

Chia-Hsuan Hsu, Jun-En Ding, Hsin-Ling Hsu, Chun-Chieh Liao, Fang-Ming Hung, Feng Liu

Main category: cs.CL

TL;DR: RPRO is a novel framework combining reinforcement learning with preference-driven reasoning refinement to enhance clinical chain-of-thought performance in medical question answering, achieving superior results with smaller models.

Details

Motivation: Existing LLMs generate reasoning chains lacking factual accuracy and clinical reliability in medical QA, requiring more reliable and clinically grounded approaches.

Method: RPRO combines reinforcement learning with preference-driven reasoning refinement, using task-adaptive reasoning templates, probabilistic evaluation aligned with clinical workflows, groupwise ranking optimization based on Bradley-Terry model, and KL-divergence regularization.

Result: Experiments on PubMedQA and MedQA-USMLE show consistent improvements over strong baselines, with 1.1B parameter model outperforming much larger 7B-13B models including medical-specialized variants.

Conclusion: Combining preference optimization with quality-driven refinement offers a scalable and effective approach to building more reliable, clinically grounded medical LLMs.

Abstract: Medical question answering requires advanced reasoning that integrates domain knowledge with logical inference. However, existing large language models (LLMs) often generate reasoning chains that lack factual accuracy and clinical reliability. We propose Ranked Preference Reinforcement Optimization (RPRO), a novel framework that uniquely combines reinforcement learning with preference-driven reasoning refinement to enhance clinical chain-of-thought (CoT) performance. RPRO differentiates itself from prior approaches by employing task-adaptive reasoning templates and a probabilistic evaluation mechanism that aligns outputs with established clinical workflows, while automatically identifying and correcting low-quality reasoning chains. Unlike traditional pairwise preference methods, RPRO introduces a groupwise ranking optimization based on the Bradley-Terry model and incorporates KL-divergence regularization for stable training. Experiments on PubMedQA and MedQA-USMLE show consistent improvements over strong baselines. Remarkably, our 1.1B parameter model outperforms much larger 7B-13B models, including medical-specialized variants. These findings demonstrate that combining preference optimization with quality-driven refinement offers a scalable and effective approach to building more reliable, clinically grounded medical LLMs.

[71] Balancing Quality and Variation: Spam Filtering Distorts Data Label Distributions

Eve Fleisig, Matthias Orlikowski, Philipp Cimiano, Dan Klein

Main category: cs.CL

TL;DR: Annotator filtering methods designed for single ground-truth contexts often remove legitimate disagreement instead of spam, harming label diversity. Conservative removal (<5%) works best, as spammers tend to give fixed answers rather than random ones.

Details

Motivation: To balance annotator reliability and representation in machine learning datasets, preserving label variation while filtering spam.

Method: Empirical evaluation of various heuristics for annotator filtering on subjective tasks, analyzing performance on synthetic spam and comparing how methods handle disagreement vs. actual spam.

Result: Most filtering methods remove annotators who disagree rather than spammers, increasing mean absolute error from true average labels. Spammers are often distributionally indistinguishable from real annotators and tend to give fixed answers, not random ones.

Conclusion: Tasks requiring variation preservation reverse the intuition of existing spam filtering methods - spammers are less random than non-spammers, highlighting the need for spam removal methods that account for label diversity.

Abstract: For machine learning datasets to accurately represent diverse opinions in a population, they must preserve variation in data labels while filtering out spam or low-quality responses. How can we balance annotator reliability and representation? We empirically evaluate how a range of heuristics for annotator filtering affect the preservation of variation on subjective tasks. We find that these methods, designed for contexts in which variation from a single ground-truth label is considered noise, often remove annotators who disagree instead of spam annotators, introducing suboptimal tradeoffs between accuracy and label diversity. We find that conservative settings for annotator removal (<5%) are best, after which all tested methods increase the mean absolute error from the true average label. We analyze performance on synthetic spam to observe that these methods often assume spam annotators are more random than real spammers tend to be: most spammers are distributionally indistinguishable from real annotators, and the minority that are distinguishable tend to give relatively fixed answers, not random ones. Thus, tasks requiring the preservation of variation reverse the intuition of existing spam filtering methods: spammers tend to be less random than non-spammers, so metrics that assume variation is spam fare worse. These results highlight the need for spam removal methods that account for label diversity.

[72] Memorization in Large Language Models in Medicine: Prevalence, Characteristics, and Implications

Anran Li, Lingfei Qian, Mengmeng Du, Yu Yin, Yan Hu, Zihao Sun, Yihang Fu, Erica Stutz, Xuguang Ai, Qianqian Xie, Rui Zhu, Jimin Huang, Yifan Yang, Siru Liu, Yih-Chung Tham, Lucila Ohno-Machado, Hyunghoon Cho, Zhiyong Lu, Hua Xu, Qingyu Chen

Main category: cs.CL

TL;DR: First comprehensive evaluation of LLM memorization in medicine, showing it’s prevalent across adaptation scenarios and higher than general domain, with three types: beneficial, uninformative, and harmful memorization.

Details

Motivation: To understand the extent of LLM memorization in medical applications, as memorization can impact diagnostic assistance, medical QA, and clinical information synthesis.

Method: Systematic analysis of three adaptation scenarios: continued pretraining on medical corpora, fine-tuning on standard medical benchmarks, and fine-tuning on real-world clinical data from Yale New Haven Health System.

Result: Memorization is prevalent across all scenarios and significantly higher than general domain. Three types identified: beneficial (clinical guidelines), uninformative (disclaimers), and harmful (sensitive patient data).

Conclusion: Practical recommendations provided to enhance beneficial memorization, minimize uninformative memorization, and mitigate harmful memorization to prevent patient data leakage.

Abstract: Large Language Models (LLMs) have demonstrated significant potential in medicine. To date, LLMs have been widely applied to tasks such as diagnostic assistance, medical question answering, and clinical information synthesis. However, a key open question remains: to what extent do LLMs memorize medical training data. In this study, we present the first comprehensive evaluation of memorization of LLMs in medicine, assessing its prevalence (how frequently it occurs), characteristics (what is memorized), volume (how much content is memorized), and potential downstream impacts (how memorization may affect medical applications). We systematically analyze common adaptation scenarios: (1) continued pretraining on medical corpora, (2) fine-tuning on standard medical benchmarks, and (3) fine-tuning on real-world clinical data, including over 13,000 unique inpatient records from Yale New Haven Health System. The results demonstrate that memorization is prevalent across all adaptation scenarios and significantly higher than reported in the general domain. Memorization affects both the development and adoption of LLMs in medicine and can be categorized into three types: beneficial (e.g., accurate recall of clinical guidelines and biomedical references), uninformative (e.g., repeated disclaimers or templated medical document language), and harmful (e.g., regeneration of dataset-specific or sensitive clinical content). Based on these findings, we offer practical recommendations to facilitate beneficial memorization that enhances domain-specific reasoning and factual accuracy, minimize uninformative memorization to promote deeper learning beyond surface-level patterns, and mitigate harmful memorization to prevent the leakage of sensitive or identifiable patient information.

[73] Findings of the Fourth Shared Task on Multilingual Coreference Resolution: Can LLMs Dethrone Traditional Approaches?

Michal Novák, Miloslav Konopík, Anna Nedoluzhko, Martin Popel, Ondřej Pražák, Jakub Sido, Milan Straka, Zdeněk Žabokrtský, Daniel Zeman

Main category: cs.CL

TL;DR: Overview of the 4th Shared Task on Multilingual Coreference Resolution at CODI-CRAC 2025, featuring a new LLM track and expanded datasets across 17 languages.

Details

Motivation: To advance multilingual coreference resolution by introducing LLM-friendly formats and expanding language coverage, while comparing traditional and LLM-based approaches.

Method: Organized shared task with traditional systems and new LLM track using simplified plaintext format; used CorefUD 1.3 with 22 datasets across 17 languages; evaluated 9 systems including 4 LLM-based approaches.

Result: Nine systems participated (4 LLM-based - 2 fine-tuned, 2 few-shot). Traditional systems maintained leadership, but LLMs demonstrated clear potential for future competition.

Conclusion: LLMs show promising potential in multilingual coreference resolution and may soon challenge established traditional approaches, indicating a shift in the field’s methodology.

Abstract: The paper presents an overview of the fourth edition of the Shared Task on Multilingual Coreference Resolution, organized as part of the CODI-CRAC 2025 workshop. As in the previous editions, participants were challenged to develop systems that identify mentions and cluster them according to identity coreference. A key innovation of this year’s task was the introduction of a dedicated Large Language Model (LLM) track, featuring a simplified plaintext format designed to be more suitable for LLMs than the original CoNLL-U representation. The task also expanded its coverage with three new datasets in two additional languages, using version 1.3 of CorefUD - a harmonized multilingual collection of 22 datasets in 17 languages. In total, nine systems participated, including four LLM-based approaches (two fine-tuned and two using few-shot adaptation). While traditional systems still kept the lead, LLMs showed clear potential, suggesting they may soon challenge established approaches in future editions.

[74] CorPipe at CRAC 2025: Evaluating Multilingual Encoders for Multilingual Coreference Resolution

Milan Straka

Main category: cs.CL

TL;DR: CorPipe 25 is the winning system for CRAC 2025 Shared Task on Multilingual Coreference Resolution, achieving top performance in both LLM and unconstrained tracks by 8 percentage points.

Details

Motivation: To participate in the CRAC 2025 Shared Task which introduced new LLM track, reduced computational requirements, and additional datasets, while migrating from TensorFlow to PyTorch.

Method: Complete reimplementation of previous systems from TensorFlow to PyTorch framework.

Result: Significantly outperformed all other submissions in both LLM and unconstrained tracks by a substantial margin of 8 percentage points.

Conclusion: CorPipe 25 demonstrates superior performance in multilingual coreference resolution, with source code and trained models publicly available.

Abstract: We present CorPipe 25, the winning entry to the CRAC 2025 Shared Task on Multilingual Coreference Resolution. This fourth iteration of the shared task introduces a new LLM track alongside the original unconstrained track, features reduced development and test sets to lower computational requirements, and includes additional datasets. CorPipe 25 represents a complete reimplementation of our previous systems, migrating from TensorFlow to PyTorch. Our system significantly outperforms all other submissions in both the LLM and unconstrained tracks by a substantial margin of 8 percentage points. The source code and trained models are publicly available at https://github.com/ufal/crac2025-corpipe.

[75] Training Large Language Models To Reason In Parallel With Global Forking Tokens

Sheng Jia, Xiao Wang, Shiva Prasad Kasiviswanathan

Main category: cs.CL

TL;DR: SSFT introduces a set-based global loss in SFT to preserve diverse reasoning modes and generate emergent global forking tokens, improving performance on reasoning benchmarks.

Details

Motivation: Common diversity strategies like temperature scaling worsen the trade-off between diversity and accuracy, as forking tokens that trigger diverse yet correct reasoning are deep in the sampling tree.

Method: Treat parallel reasoning as a set-of-next-token-prediction problem, incorporate set-based global loss into SFT using self-supervised bipartite matching between global forking tokens and unique reasoning traces.

Result: SSFT consistently outperforms SFT on multiple reasoning benchmarks under both Pass@1 and Cons@k metrics, preserving unique reasoning modes while naive fine-tuning collapses them.

Conclusion: Set Supervised Fine-Tuning effectively maintains diverse reasoning paths and produces emergent global forking tokens, improving parallel reasoning performance.

Abstract: Although LLMs have demonstrated improved performance by scaling parallel test-time compute, doing so relies on generating reasoning paths that are both diverse and accurate. For challenging problems, the forking tokens that trigger diverse yet correct reasoning modes are typically deep in the sampling tree. Consequently, common strategies to encourage diversity, such as temperature scaling, encounter a worsened trade-off between diversity and accuracy. Motivated by this challenge, we treat parallel reasoning as a set-of-next-token-prediction problem, and incorporate a set-based global loss into Supervised Fine-Tuning (SFT) using self-supervised bipartite matching between our global forking tokens and unique reasoning traces. We observe that, while naive fine-tuning with multiple reasoning traces collapses these unique reasoning modes, our proposed method, Set Supervised Fine-Tuning (SSFT), preserves these modes and produces emergent global forking tokens. Experiments on multiple reasoning benchmarks show that our SSFT consistently outperforms SFT under both Pass@1 and Cons@k metrics.

[76] Mathematics with large language models as provers and verifiers

Hieu Le Duc, Leo Liberti

Main category: cs.CL

TL;DR: ChatGPT (gpt-5) successfully solved 5/6 2025 IMO problems and proved about a third of 66 number theory conjectures using collaborative theorem proving with formal verification.

Details

Motivation: To test and demonstrate the theorem-proving capabilities of large language models, particularly on challenging mathematical problems like IMO exercises and number theory conjectures.

Method: Used collaborative protocol with multiple gpt-5 instances (provers and verifiers) working together, with final proofs formally verified by Lean proof assistant and human-checked for premise-conclusion conformance.

Result: Successfully solved five out of six 2025 International Mathematical Olympiad problems and proved approximately one-third of 66 number theory conjectures from Cohen’s 2025 paper.

Conclusion: Large language models can achieve significant theorem-proving success on challenging mathematical problems when using collaborative protocols and formal verification, though the methodology is not yet complete or exact.

Abstract: During 2024 and 2025 the discussion about the theorem-proving capabilities of large language models started reporting interesting success stories, mostly to do with difficult exercises (such as problems from the International Mathematical Olympiad), but also with conjectures [Feldman & Karbasi, arXiv:2509.18383v1] formulated for the purpose of verifying whether the artificial intelligence could prove it. In this paper we report a theorem proving feat achieved by ChatGPT by using a protocol involving different prover and verifier instances of the gpt-5 model working collaboratively. To make sure that the produced proofs do not suffer from hallucinations, the final proof is formally verified by the lean proof assistant, and the conformance of premises and conclusion of the lean code is verified by a human. Our methodology is by no means complete or exact. It was nonetheless able to solve five out of six 2025 IMO problems, and close about a third of the sixty-six number theory conjectures in [Cohen, Journal of Integer Sequences, 2025].

[77] META-RAG: Meta-Analysis-Inspired Evidence-Re-Ranking Method for Retrieval-Augmented Generation in Evidence-Based Medicine

Mengzhou Sun, Sendong Zhao, Jianyu Chen, Haochun Wang, Bing Qin

Main category: cs.CL

TL;DR: A new method that combines multiple EBM principles (reliability, heterogeneity, and extrapolation analysis) to re-rank and filter medical evidence for LLMs, improving RAG performance in evidence-based medicine tasks by up to 11.4% accuracy.

Details

Motivation: Current RAG applications in evidence-based medicine struggle to efficiently distinguish high-quality evidence, which is crucial for reducing misdiagnoses in clinical applications.

Method: Inspired by meta-analysis in EBM, the method employs reliability analysis, heterogeneity analysis, and extrapolation analysis to re-rank and filter medical evidence from PubMed dataset for LLMs.

Result: The method shows an accuracy improvement of up to 11.4% in experiments, successfully enabling RAG to extract higher-quality and more reliable evidence.

Conclusion: This approach reduces the infusion of incorrect knowledge into LLM responses and helps users receive more effective replies in evidence-based medicine applications.

Abstract: Evidence-based medicine (EBM) holds a crucial role in clinical application. Given suitable medical articles, doctors effectively reduce the incidence of misdiagnoses. Researchers find it efficient to use large language models (LLMs) techniques like RAG for EBM tasks. However, the EBM maintains stringent requirements for evidence, and RAG applications in EBM struggle to efficiently distinguish high-quality evidence. Therefore, inspired by the meta-analysis used in EBM, we provide a new method to re-rank and filter the medical evidence. This method presents multiple principles to filter the best evidence for LLMs to diagnose. We employ a combination of several EBM methods to emulate the meta-analysis, which includes reliability analysis, heterogeneity analysis, and extrapolation analysis. These processes allow the users to retrieve the best medical evidence for the LLMs. Ultimately, we evaluate these high-quality articles and show an accuracy improvement of up to 11.4% in our experiments and results. Our method successfully enables RAG to extract higher-quality and more reliable evidence from the PubMed dataset. This work can reduce the infusion of incorrect knowledge into responses and help users receive more effective replies.

[78] Decomposition-Enhanced Training for Post-Hoc Attributions In Language Models

Sriram Balasubramanian, Samyadeep Basu, Koustava Goswami, Ryan Rossi, Varun Manjunatha, Roshan Santhosh, Ruiyi Zhang, Soheil Feizi, Nedim Lipka

Main category: cs.CL

TL;DR: DecompTune improves attribution in long-document QA by teaching models to generate answer decompositions as reasoning steps, outperforming prior methods.

Details

Motivation: Existing post-hoc attribution methods struggle with multi-hop, abstractive, and semi-extractive QA where answers synthesize information across passages.

Method: DecompTune: post-training method using SFT + GRPO pipeline to teach models to produce answer decompositions as intermediate reasoning steps, trained on curated dataset of complex QA tasks.

Result: Substantially improves attribution quality, outperforming prior methods and matching or exceeding state-of-the-art frontier models.

Conclusion: Reframing attribution as a reasoning problem through answer decomposition is effective for improving attribution reliability in complex QA settings.

Abstract: Large language models (LLMs) are increasingly used for long-document question answering, where reliable attribution to sources is critical for trust. Existing post-hoc attribution methods work well for extractive QA but struggle in multi-hop, abstractive, and semi-extractive settings, where answers synthesize information across passages. To address these challenges, we argue that post-hoc attribution can be reframed as a reasoning problem, where answers are decomposed into constituent units, each tied to specific context. We first show that prompting models to generate such decompositions alongside attributions improves performance. Building on this, we introduce DecompTune, a post-training method that teaches models to produce answer decompositions as intermediate reasoning steps. We curate a diverse dataset of complex QA tasks, annotated with decompositions by a strong LLM, and post-train Qwen-2.5 (7B and 14B) using a two-stage SFT + GRPO pipeline with task-specific curated rewards. Across extensive experiments and ablations, DecompTune substantially improves attribution quality, outperforming prior methods and matching or exceeding state-of-the-art frontier models.

[79] VISTA Score: Verification In Sequential Turn-based Assessment

Ashley Lewis, Andrew Perrault, Eric Fosler-Lussier, Michael White

Main category: cs.CL

TL;DR: VISTA is a framework for evaluating conversational factuality through claim-level verification and sequential consistency tracking, improving hallucination detection in multi-turn dialogues.

Details

Motivation: Existing metrics for conversational AI factuality either evaluate isolated responses or treat unverifiable content as errors, limiting their effectiveness for multi-turn dialogue evaluation where hallucination remains a major obstacle.

Method: VISTA decomposes each assistant turn into atomic factual claims, verifies them against trusted sources and dialogue history, and categorizes unverifiable statements into subjective, contradicted, lacking evidence, or abstaining categories.

Result: Across eight large language models and four dialogue factuality benchmarks (AIS, BEGIN, FAITHDIAL, and FADE), VISTA substantially improves hallucination detection over FACTSCORE and LLM-as-Judge baselines. Human evaluation confirms improved annotator agreement.

Conclusion: By modeling factuality as a dynamic property of conversation, VISTA offers a more transparent, human-aligned measure of truthfulness in dialogue systems.

Abstract: Hallucination–defined here as generating statements unsupported or contradicted by available evidence or conversational context–remains a major obstacle to deploying conversational AI systems in settings that demand factual reliability. Existing metrics either evaluate isolated responses or treat unverifiable content as errors, limiting their use for multi-turn dialogue. We introduce VISTA (Verification In Sequential Turn-based Assessment), a framework for evaluating conversational factuality through claim-level verification and sequential consistency tracking. VISTA decomposes each assistant turn into atomic factual claims, verifies them against trusted sources and dialogue history, and categorizes unverifiable statements (subjective, contradicted, lacking evidence, or abstaining). Across eight large language models and four dialogue factuality benchmarks (AIS, BEGIN, FAITHDIAL, and FADE), VISTA substantially improves hallucination detection over FACTSCORE and LLM-as-Judge baselines. Human evaluation confirms that VISTA’s decomposition improves annotator agreement and reveals inconsistencies in existing benchmarks. By modeling factuality as a dynamic property of conversation, VISTA offers a more transparent, human-aligned measure of truthfulness in dialogue systems.

[80] OceanAI: A Conversational Platform for Accurate, Transparent, Near-Real-Time Oceanographic Insights

Bowen Chen, Jayesh Gajbhar, Gregory Dusek, Rob Redmon, Patrick Hogan, Paul Liu, DelWayne Bohnenstiehl, Dongkuan Xu, Ruoying He

Main category: cs.CL

TL;DR: OceanAI is a conversational AI platform that combines LLM fluency with real-time access to NOAA oceanographic data, providing verifiable responses and visualizations to prevent AI hallucinations in scientific contexts.

Details

Motivation: To address the problem of AI hallucinations in scientific applications by creating a system that grounds responses in authoritative, real-time oceanographic data from NOAA.

Method: Integrates open-source LLMs with parameterized API calls to NOAA data streams, automatically identifying, parsing, and synthesizing relevant datasets into natural-language responses and data visualizations.

Result: In blind comparisons with three commercial AI chat products, only OceanAI produced NOAA-sourced values with original data references; others either declined to answer or provided unsupported results.

Conclusion: OceanAI advances transparency, reproducibility, and trust in AI for scientific applications by providing verifiable, data-grounded responses, offering a scalable framework for AI-enabled decision support in ocean sciences.

Abstract: Artificial intelligence is transforming the sciences, yet general conversational AI systems often generate unverified “hallucinations” undermining scientific rigor. We present OceanAI, a conversational platform that integrates the natural-language fluency of open-source large language models (LLMs) with real-time, parameterized access to authoritative oceanographic data streams hosted by the National Oceanic and Atmospheric Administration (NOAA). Each query such as “What was Boston Harbor’s highest water level in 2024?” triggers real-time API calls that identify, parse, and synthesize relevant datasets into reproducible natural-language responses and data visualizations. In a blind comparison with three widely used AI chat-interface products, only OceanAI produced NOAA-sourced values with original data references; others either declined to answer or provided unsupported results. Designed for extensibility, OceanAI connects to multiple NOAA data products and variables, supporting applications in marine hazard forecasting, ecosystem assessment, and water-quality monitoring. By grounding outputs and verifiable observations, OceanAI advances transparency, reproducibility, and trust, offering a scalable framework for AI-enabled decision support within the oceans. A public demonstration is available at https://oceanai.ai4ocean.xyz.

[81] LiveSearchBench: An Automatically Constructed Benchmark for Retrieval and Reasoning over Dynamic Knowledge

Heng Zhou, Ao Yu, Yuchen Fan, Jianing Shi, Li Kang, Hejia Geng, Yongting Zhang, Yutao Fan, Yuhao Wu, Tiancheng He, Yiran Qin, Lei Bai, Zhenfei Yin

Main category: cs.CL

TL;DR: LiveSearchBench is an automated pipeline for creating retrieval-dependent benchmarks from recent knowledge updates in Wikidata, addressing the limitations of static benchmarks that reward memorization rather than dynamic knowledge retrieval.

Details

Motivation: Current LLM evaluation benchmarks are static and reward memorization, failing to capture the dynamic nature of world knowledge and the importance of retrieval in real-world question answering scenarios.

Method: Automated pipeline that computes deltas between successive Wikidata snapshots, filters candidate triples for quality, and synthesizes natural-language questions at three reasoning difficulty levels with unique, verifiable answers through SPARQL validation.

Result: Experiments show significant performance drop when models face facts post-dating pretraining, especially on multi-hop queries. Retrieval-augmented methods and larger models provide partial improvements but don’t fully close the recency gap.

Conclusion: LiveSearchBench shifts evaluation from static memorization toward tasks requiring up-to-date retrieval and reasoning, enabling systematic long-term assessment of LLMs under evolving knowledge.

Abstract: Evaluating large language models (LLMs) on question answering often relies on static benchmarks that reward memorization and understate the role of retrieval, failing to capture the dynamic nature of world knowledge. We present LiveSearchBench, an automated pipeline for constructing retrieval-dependent benchmarks from recent knowledge updates. Our method computes deltas between successive Wikidata snapshots, filters candidate triples for quality, and synthesizes natural-language questions at three levels of reasoning difficulty, each guaranteed to admit a unique, verifiable answer through SPARQL validation. The pipeline is fully automated, scalable across time, and minimizes human intervention, enabling continual regeneration of temporally grounded benchmarks. Experiments show a pronounced performance drop when models confront facts that post-date pretraining, with the gap most salient on multi-hop queries. Retrieval augmented methods and larger, instruction-tuned models provide partial gains but fail to close this recency gap. By design, LiveSearchBench shifts evaluation from static memorization toward tasks that require up-to-date retrieval and reasoning, offering a foundation for systematic, long-term assessment of LLMs under evolving knowledge.

[82] Control Barrier Function for Aligning Large Language Models

Yuya Miyaoka, Masaki Inoue

Main category: cs.CL

TL;DR: Control-based framework using control barrier functions (CBF) to align LLMs without fine-tuning, ensuring user-desirable text generation through safety filtering.

Details

Motivation: To ensure large language models generate user-desirable text through a safety intervention mechanism that doesn't require model fine-tuning.

Method: Apply control barrier function (CBF) safety filter to predicted tokens from baseline LLM, creating an add-on intervention system that can directly use evaluation models for filter design.

Result: Implemented text-generation system with open-source language models that successfully generates positive text through the CBF safety filtering approach.

Conclusion: The CBF-based safety filter provides an effective add-on solution for LLM alignment that works without model fine-tuning and can leverage existing evaluation models.

Abstract: This paper proposes a control-based framework for aligning large language models (LLMs) by leveraging a control barrier function (CBF) to ensure user-desirable text generation. The presented framework applies the CBF safety filter to the predicted token generated from the baseline LLM, to intervene in the generated text. The safety filter includes two significant advantages: this safety filter is an add-on type, allowing it to be used for alignment purposes without fine-tuning the baseline LLM, and if there is an evaluation model regarding the desired alignment, it can be directly applied to the filter design. The overall text-generation system is implemented with open-source language models, aiming to generate positive text.

[83] CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field

Doria Bonzi, Alexandre Guiggi, Frédéric Béchet, Carlos Ramisch, Benoit Favre

Main category: cs.CL

TL;DR: CareMedEval is a new dataset for evaluating LLMs on biomedical critical appraisal, derived from French medical exams with 534 questions from 37 scientific articles. Current LLMs struggle with this task, especially on study limitations and statistical analysis questions.

Details

Motivation: Critical appraisal of scientific literature is essential in biomedicine, but LLMs have limited reliability for critical reasoning in specialized domains. There's a need for better evaluation benchmarks for grounded biomedical reasoning.

Method: Created CareMedEval dataset from authentic French medical student exams containing 534 questions based on 37 scientific articles. Benchmarked state-of-the-art generalist and biomedical-specialized LLMs under various context conditions.

Result: Both open and commercial LLMs failed to exceed an Exact Match Rate of 0.5. Generating intermediate reasoning tokens improved results, but models remained challenged on questions about study limitations and statistical analysis.

Conclusion: CareMedEval provides a challenging benchmark that exposes current LLM limitations in grounded biomedical reasoning and paves the way for future development of automated support for critical appraisal.

Abstract: Critical appraisal of scientific literature is an essential skill in the biomedical field. While large language models (LLMs) can offer promising support in this task, their reliability remains limited, particularly for critical reasoning in specialized domains. We introduce CareMedEval, an original dataset designed to evaluate LLMs on biomedical critical appraisal and reasoning tasks. Derived from authentic exams taken by French medical students, the dataset contains 534 questions based on 37 scientific articles. Unlike existing benchmarks, CareMedEval explicitly evaluates critical reading and reasoning grounded in scientific papers. Benchmarking state-of-the-art generalist and biomedical-specialized LLMs under various context conditions reveals the difficulty of the task: open and commercial models fail to exceed an Exact Match Rate of 0.5 even though generating intermediate reasoning tokens considerably improves the results. Yet, models remain challenged especially on questions about study limitations and statistical analysis. CareMedEval provides a challenging benchmark for grounded reasoning, exposing current LLM limitations and paving the way for future development of automated support for critical appraisal.

cs.CV

[84] LoRA-Edge: Tensor-Train-Assisted LoRA for Practical CNN Fine-Tuning on Edge Devices

Hyunseok Kwak, Kyeongwon Lee, Jae-Jin Lee, Woojoo Lee

Main category: cs.CV

TL;DR: LoRA-Edge enables parameter-efficient fine-tuning of CNNs for edge applications like HAR by using tensor-train decomposition and selective core updates, achieving near-full fine-tuning accuracy with 1-2 orders of magnitude fewer parameters.

Details

Motivation: Full fine-tuning of CNNs is infeasible under strict memory, compute, and energy budgets for edge applications like Human Activity Recognition (HAR) that need to withstand domain shift.

Method: Applies Tensor-Train SVD to pre-trained convolutional layers, selectively updates only the output-side core with zero-initialization, and fuses updates back into dense kernels without changing inference cost.

Result: Achieves accuracy within 4.7% of full fine-tuning while updating at most 1.49% of parameters, with 1.4-3.8x faster convergence on Jetson Orin Nano across diverse HAR datasets and CNN backbones.

Conclusion: LoRA-Edge makes structure-aligned, parameter-efficient on-device CNN adaptation practical for edge platforms.

Abstract: On-device fine-tuning of CNNs is essential to withstand domain shift in edge applications such as Human Activity Recognition (HAR), yet full fine-tuning is infeasible under strict memory, compute, and energy budgets. We present LoRA-Edge, a parameter-efficient fine-tuning (PEFT) method that builds on Low-Rank Adaptation (LoRA) with tensor-train assistance. LoRA-Edge (i) applies Tensor-Train Singular Value Decomposition (TT-SVD) to pre-trained convolutional layers, (ii) selectively updates only the output-side core with zero-initialization to keep the auxiliary path inactive at the start, and (iii) fuses the update back into dense kernels, leaving inference cost unchanged. This design preserves convolutional structure and reduces the number of trainable parameters by up to two orders of magnitude compared to full fine-tuning. Across diverse HAR datasets and CNN backbones, LoRA-Edge achieves accuracy within 4.7% of full fine-tuning while updating at most 1.49% of parameters, consistently outperforming prior parameter-efficient baselines under similar budgets. On a Jetson Orin Nano, TT-SVD initialization and selective-core training yield 1.4-3.8x faster convergence to target F1. LoRA-Edge thus makes structure-aligned, parameter-efficient on-device CNN adaptation practical for edge platforms.

[85] SILVI: Simple Interface for Labeling Video Interactions

Ozan Kanbertay, Richard Vogg, Elif Karakoc, Peter M. Kappeler, Claudia Fichtel, Alexander S. Ecker

Main category: cs.CV

TL;DR: SILVI is an open-source labeling software that bridges the gap between behavioral labeling and individual localization for analyzing animal interactions in video data, enabling structured annotations for computer vision model training.

Details

Motivation: Existing annotation tools either support behavioral labeling without localization or localization without interaction capture, creating a gap for analyzing social and individualized animal behavior through computer vision methods.

Method: Developed SILVI - an open-source labeling software that integrates both behavioral annotation and individual localization functionalities, allowing researchers to annotate behaviors and interactions directly within video data.

Result: SILVI generates structured outputs suitable for training and validating computer vision models, facilitating the development of automated approaches for fine-grained behavioral analyses.

Conclusion: SILVI effectively links behavioral ecology with computer vision, providing a tool that could be broadly useful for annotating interactions in various video contexts beyond animal behavior, including human interactions requiring dynamic scene graph extraction.

Abstract: Computer vision methods are increasingly used for the automated analysis of large volumes of video data collected through camera traps, drones, or direct observations of animals in the wild. While recent advances have focused primarily on detecting individual actions, much less work has addressed the detection and annotation of interactions – a crucial aspect for understanding social and individualized animal behavior. Existing open-source annotation tools support either behavioral labeling without localization of individuals, or localization without the capacity to capture interactions. To bridge this gap, we present SILVI, an open-source labeling software that integrates both functionalities. SILVI enables researchers to annotate behaviors and interactions directly within video data, generating structured outputs suitable for training and validating computer vision models. By linking behavioral ecology with computer vision, SILVI facilitates the development of automated approaches for fine-grained behavioral analyses. Although developed primarily in the context of animal behavior, SILVI could be useful more broadly to annotate human interactions in other videos that require extracting dynamic scene graphs. The software, along with documentation and download instructions, is available at: https://gitlab.gwdg.de/kanbertay/interaction-labelling-app.

[86] PixCLIP: Achieving Fine-grained Visual Language Understanding via Any-granularity Pixel-Text Alignment Learning

Yicheng Xiao, Yu Chen, Haoxuan Ma, Jiale Hong, Caorui Li, Lingxiang Wu, Haiyun Guo, Jinqiao Wang

Main category: cs.CV

TL;DR: PixCLIP enhances CLIP’s fine-grained vision-language alignment by processing visual prompts and long textual descriptions simultaneously, achieving state-of-the-art performance through pixel-text alignment.

Details

Motivation: Improve CLIP's fine-grained image-text alignment by overcoming its text encoder's token length limitation and synergistically enhancing both visual and textual content processing granularity.

Method: Propose PixCLIP framework with automated annotation pipeline for pixel-level localized long-form descriptions, construct LongGRIT dataset, replace CLIP’s text encoder with LLM, and implement three-branch pixel-text alignment learning.

Result: PixCLIP achieves breakthroughs in pixel-level interaction and long-form text handling, demonstrating state-of-the-art performance in fine-grained vision-language alignment tasks.

Conclusion: The proposed PixCLIP framework successfully addresses CLIP’s limitations by enabling concurrent processing of visual prompts and lengthy textual descriptions, establishing new state-of-the-art in fine-grained vision-language alignment.

Abstract: While the Contrastive Language-Image Pretraining(CLIP) model has achieved remarkable success in a variety of downstream vison language understanding tasks, enhancing its capability for fine-grained image-text alignment remains an active research focus. To this end, most existing works adopt the strategy of explicitly increasing the granularity of visual information processing, e.g., incorporating visual prompts to guide the model focus on specific local regions within the image. Meanwhile, researches on Multimodal Large Language Models(MLLMs) have demonstrated that training with long and detailed textual descriptions can effectively improve the model’s fine-grained vision-language alignment. However, the inherent token length limitation of CLIP’s text encoder fundamentally limits CLIP to process more granular textual information embedded in long text sequences. To synergistically leverage the advantages of enhancing both visual and textual content processing granularity, we propose PixCLIP, a novel framework designed to concurrently accommodate visual prompt inputs and process lengthy textual descriptions. Specifically, we first establish an automated annotation pipeline capable of generating pixel-level localized, long-form textual descriptions for images. Utilizing this pipeline, we construct LongGRIT, a high-quality dataset comprising nearly 1.5 million samples. Secondly, we replace CLIP’s original text encoder with the LLM and propose a three-branch pixel-text alignment learning framework, facilitating fine-grained alignment between image regions and corresponding textual descriptions at arbitrary granularity. Experiments demonstrate that PixCLIP showcases breakthroughs in pixel-level interaction and handling long-form texts, achieving state-of-the-art performance.

[87] Noise Injection: Improving Out-of-Distribution Generalization for Limited Size Datasets

Duong Mai, Lawrence Hall

Main category: cs.CV

TL;DR: Using noise injection techniques during training improves COVID-19 detection from chest X-rays by reducing performance gap between in-distribution and out-of-distribution data.

Details

Motivation: Deep learning models for COVID-19 detection from chest X-rays fail to generalize to new clinical sources due to learning source-specific shortcuts rather than actual biomarkers.

Method: Applied fundamental noise injection techniques (Gaussian, Speckle, Poisson, and Salt and Pepper) during model training to improve robustness to distribution shifts.

Result: Reduced performance gap between in-distribution and out-of-distribution evaluation from 0.10-0.20 to 0.01-0.06 across key metrics including AUC, F1, accuracy, recall and specificity.

Conclusion: Noise injection during training is an effective technique to improve generalization of COVID-19 detection models across different clinical sources and devices.

Abstract: Deep learned (DL) models for image recognition have been shown to fail to generalize to data from different devices, populations, etc. COVID-19 detection from Chest X-rays (CXRs), in particular, has been shown to fail to generalize to out-of-distribution (OOD) data from new clinical sources not covered in the training set. This occurs because models learn to exploit shortcuts - source-specific artifacts that do not translate to new distributions - rather than reasonable biomarkers to maximize performance on in-distribution (ID) data. Rendering the models more robust to distribution shifts, our study investigates the use of fundamental noise injection techniques (Gaussian, Speckle, Poisson, and Salt and Pepper) during training. Our empirical results demonstrate that this technique can significantly reduce the performance gap between ID and OOD evaluation from 0.10-0.20 to 0.01-0.06, based on results averaged over ten random seeds across key metrics such as AUC, F1, accuracy, recall and specificity. Our source code is publicly available at https://github.com/Duongmai127/Noisy-ood

[88] Investigating Robot Control Policy Learning for Autonomous X-ray-guided Spine Procedures

Florence Klitzner, Blanca Inigo, Benjamin D. Killeen, Lalithkumar Seenivasan, Michelle Song, Axel Krieger, Mathias Unberath

Main category: cs.CV

TL;DR: Imitation learning policies for X-ray-guided spine cannula insertion were developed and tested in simulation, achieving 68.5% first-attempt success with safe trajectories across diverse anatomy, showing promise for CT-free robotic spinal navigation.

Details

Motivation: To examine whether imitation learning applies to complex X-ray-guided procedures like spine instrumentation, where multi-view X-ray interpretation is challenging, specifically for bi-plane-guided cannula insertion.

Method: Created an in silico sandbox for realistic simulation of X-ray-guided spine procedures, curated dataset of correct trajectories and bi-planar X-ray sequences, trained imitation learning policies for planning and open-loop control using visual information only.

Result: Policy succeeded on first attempt in 68.5% of cases, maintained safe intra-pedicular trajectories across diverse vertebral levels, generalized to complex anatomy including fractures, remained robust to varied initializations, and produced plausible trajectories on real bi-planar X-rays despite simulation-only training.

Conclusion: While promising with good generalization and robustness, limitations exist in entry point precision. Full closed-loop control requires more frequent feedback, but with robust priors and domain knowledge, such models could enable lightweight CT-free robotic spinal navigation.

Abstract: Imitation learning-based robot control policies are enjoying renewed interest in video-based robotics. However, it remains unclear whether this approach applies to X-ray-guided procedures, such as spine instrumentation. This is because interpretation of multi-view X-rays is complex. We examine opportunities and challenges for imitation policy learning in bi-plane-guided cannula insertion. We develop an in silico sandbox for scalable, automated simulation of X-ray-guided spine procedures with a high degree of realism. We curate a dataset of correct trajectories and corresponding bi-planar X-ray sequences that emulate the stepwise alignment of providers. We then train imitation learning policies for planning and open-loop control that iteratively align a cannula solely based on visual information. This precisely controlled setup offers insights into limitations and capabilities of this method. Our policy succeeded on the first attempt in 68.5% of cases, maintaining safe intra-pedicular trajectories across diverse vertebral levels. The policy generalized to complex anatomy, including fractures, and remained robust to varied initializations. Rollouts on real bi-planar X-rays further suggest that the model can produce plausible trajectories, despite training exclusively in simulation. While these preliminary results are promising, we also identify limitations, especially in entry point precision. Full closed-look control will require additional considerations around how to provide sufficiently frequent feedback. With more robust priors and domain knowledge, such models may provide a foundation for future efforts toward lightweight and CT-free robotic intra-operative spinal navigation.

[89] Desert Waste Detection and Classification Using Data-Based and Model-Based Enhanced YOLOv12 DL Model

Abdulmumin Sa’ad, Sulaimon Oyeniyi Adebayo, Abdul Jabbar Siddiqui

Main category: cs.CV

TL;DR: Enhanced lightweight YOLOv12 with Self-Adversarial Training and specialized data augmentation achieves real-time waste detection in desert environments with improved accuracy and efficiency.

Details

Motivation: Address the global waste crisis, particularly in remote/harsh environments like deserts where traditional collection is inefficient and hazardous, and fill the gap in automated waste detection research that overlooks organic/hazardous waste and underexplored terrains.

Method: Proposed an enhanced real-time object detection framework using pruned lightweight YOLOv12 integrated with Self-Adversarial Training (SAT) and specialized data augmentation strategies, evaluated on DroneTrashNet dataset.

Result: Significant improvements in precision, recall, and mAP while achieving low latency and compact model size suitable for deployment on resource-constrained aerial drones; outperformed state-of-the-art lightweight YOLO variants.

Conclusion: Combining data-centric and model-centric enhancements provides an effective solution for robust, real-time waste detection in challenging desert environments.

Abstract: The global waste crisis is escalating, with solid waste generation expected to increase by 70% by 2050. Traditional waste collection methods, particularly in remote or harsh environments like deserts, are labor-intensive, inefficient, and often hazardous. Recent advances in computer vision and deep learning have opened the door to automated waste detection systems, yet most research focuses on urban environments and recyclable materials, overlooking organic and hazardous waste and underexplored terrains such as deserts. In this work, we propose an enhanced real-time object detection framework based on a pruned, lightweight version of YOLOv12 integrated with Self-Adversarial Training (SAT) and specialized data augmentation strategies. Using the DroneTrashNet dataset, we demonstrate significant improvements in precision, recall, and mean average precision (mAP), while achieving low latency and compact model size suitable for deployment on resource-constrained aerial drones. Benchmarking our model against state-of-the-art lightweight YOLO variants further highlights its optimal balance of accuracy and efficiency. Our results validate the effectiveness of combining data-centric and model-centric enhancements for robust, real-time waste detection in desert environments.

[90] Improving Diagnostic Performance on Small and Imbalanced Datasets Using Class-Based Input Image Composition

Hlali Azzeddine, Majid Ben Yakhlef, Soulaiman El Hazzat

Main category: cs.CV

TL;DR: Class-Based Image Composition creates composite images from multiple same-class samples to address class imbalance and small datasets, achieving near-perfect accuracy (99.6%) on OCT medical imaging data.

Details

Motivation: Small, imbalanced datasets and poor image quality lead to high false prediction rates in deep learning models, particularly in medical imaging where subtle disease patterns need to be distinguished.

Method: Proposes Composite Input Images (CoImg) - fusing multiple images of the same class into combined visual composites arranged in 3x1 layouts, creating a class-balanced dataset (Co-OCTDL) from the original imbalanced OCT dataset.

Result: Achieved near-perfect performance: 99.6% accuracy, 0.995 F1-score, 0.9996 AUC, with significantly reduced false prediction rates compared to baseline model trained on raw dataset.

Conclusion: The method effectively addresses class imbalance and small sample size issues in medical imaging, producing high-quality predictions even for weak datasets through enhanced intra-class variance and information density.

Abstract: Small, imbalanced datasets and poor input image quality can lead to high false predictions rates with deep learning models. This paper introduces Class-Based Image Composition, an approach that allows us to reformulate training inputs through a fusion of multiple images of the same class into combined visual composites, named Composite Input Images (CoImg). That enhances the intra-class variance and improves the valuable information density per training sample and increases the ability of the model to distinguish between subtle disease patterns. Our method was evaluated on the Optical Coherence Tomography Dataset for Image-Based Deep Learning Methods (OCTDL) (Kulyabin et al., 2024), which contains 2,064 high-resolution optical coherence tomography (OCT) scans of the human retina, representing seven distinct diseases with a significant class imbalance. We constructed a perfectly class-balanced version of this dataset, named Co-OCTDL, where each scan is resented as a 3x1 layout composite image. To assess the effectiveness of this new representation, we conducted a comparative analysis between the original dataset and its variant using a VGG16 model. A fair comparison was ensured by utilizing the identical model architecture and hyperparameters for all experiments. The proposed approach markedly improved diagnostic results.The enhanced Dataset achieved near-perfect accuracy (99.6%) with F1-score (0.995) and AUC (0.9996), compared to a baseline model trained on raw dataset. The false prediction rate was also significantly lower, this demonstrates that the method can producehigh-quality predictions even for weak datasets affected by class imbalance or small sample size.

[91] Deep learning-based object detection of offshore platforms on Sentinel-1 Imagery and the impact of synthetic training data

Robin Spanier, Thorsten Hoeser, Claudia Kuenzer

Main category: cs.CV

TL;DR: Using synthetic and real Sentinel-1 satellite data with YOLOv10 models improves offshore infrastructure detection, achieving F1 score of 0.90 and demonstrating geographic transferability across unseen regions.

Details

Motivation: Need for effective monitoring systems for expanding marine infrastructure, addressing data scarcity and class imbalance in underrepresented object classes.

Method: Trained YOLOv10 object detection models with synthetic and real Sentinel-1 satellite imagery from four regions, evaluated on three unseen regions using region-holdout approach.

Result: Detected 3,529 offshore platforms across three regions, achieved F1 score of 0.85 (improved to 0.90 with synthetic data), demonstrated successful geographic transferability.

Conclusion: Synthetic data generation effectively addresses dataset imbalance and improves model performance, enabling globally transferable offshore infrastructure detection using deep learning.

Abstract: The recent and ongoing expansion of marine infrastructure, including offshore wind farms, oil and gas platforms, artificial islands, and aquaculture facilities, highlights the need for effective monitoring systems. The development of robust models for offshore infrastructure detection relies on comprehensive, balanced datasets, but falls short when samples are scarce, particularly for underrepresented object classes, shapes, and sizes. By training deep learning-based YOLOv10 object detection models with a combination of synthetic and real Sentinel-1 satellite imagery acquired in the fourth quarter of 2023 from four regions (Caspian Sea, South China Sea, Gulf of Guinea, and Coast of Brazil), this study investigates the use of synthetic training data to enhance model performance. We evaluated this approach by applying the model to detect offshore platforms in three unseen regions (Gulf of Mexico, North Sea, Persian Gulf) and thereby assess geographic transferability. This region-holdout evaluation demonstrated that the model generalises beyond the training areas. In total, 3,529 offshore platforms were detected, including 411 in the North Sea, 1,519 in the Gulf of Mexico, and 1,593 in the Persian Gulf. The model achieved an F1 score of 0.85, which improved to 0.90 upon incorporating synthetic data. We analysed how synthetic data enhances the representation of unbalanced classes and overall model performance, taking a first step toward globally transferable detection of offshore infrastructure. This study underscores the importance of balanced datasets and highlights synthetic data generation as an effective strategy to address common challenges in remote sensing, demonstrating the potential of deep learning for scalable, global offshore infrastructure monitoring.

[92] I Detect What I Don’t Know: Incremental Anomaly Learning with Stochastic Weight Averaging-Gaussian for Oracle-Free Medical Imaging

Nand Kumar Yadav, Rodrigue Rizk, William CW Chen, KC Santosh

Main category: cs.CV

TL;DR: Unsupervised anomaly detection framework for medical imaging that incrementally expands normal samples without labels using lightweight adapters and uncertainty-gated admission.

Details

Motivation: Addresses the challenge of unknown anomaly detection in medical imaging where labeled anomalies are scarce and expert supervision is costly.

Method: Uses frozen pretrained vision backbone with tiny convolutional adapters, k-NN anomaly scoring with coreset storage, and dual probabilistic gates (z-score threshold and SWAG-based epistemic uncertainty) for safe sample admission.

Result: Substantial performance gains: COVID-CXR ROC-AUC from 0.9489 to 0.9982, Pneumonia CXR ROC-AUC from 0.6834 to 0.8968, Brain MRI ND-5 ROC-AUC from 0.6041 to 0.7269.

Conclusion: The framework is effective and efficient for real-world, label-scarce medical imaging applications, steadily refining normality as unlabeled data arrive.

Abstract: Unknown anomaly detection in medical imaging remains a fundamental challenge due to the scarcity of labeled anomalies and the high cost of expert supervision. We introduce an unsupervised, oracle-free framework that incrementally expands a trusted set of normal samples without any anomaly labels. Starting from a small, verified seed of normal images, our method alternates between lightweight adapter updates and uncertainty-gated sample admission. A frozen pretrained vision backbone is augmented with tiny convolutional adapters, ensuring rapid domain adaptation with negligible computational overhead. Extracted embeddings are stored in a compact coreset enabling efficient k-nearest neighbor anomaly (k-NN) scoring. Safety during incremental expansion is enforced by dual probabilistic gates, a sample is admitted into the normal memory only if its distance to the existing coreset lies within a calibrated z-score threshold, and its SWAG-based epistemic uncertainty remains below a seed-calibrated bound. This mechanism prevents drift and false inclusions without relying on generative reconstruction or replay buffers. Empirically, our system steadily refines the notion of normality as unlabeled data arrive, producing substantial gains over baselines. On COVID-CXR, ROC-AUC improves from 0.9489 to 0.9982 (F1: 0.8048 to 0.9746); on Pneumonia CXR, ROC-AUC rises from 0.6834 to 0.8968; and on Brain MRI ND-5, ROC-AUC increases from 0.6041 to 0.7269 and PR-AUC from 0.7539 to 0.8211. These results highlight the effectiveness and efficiency of the proposed framework for real-world, label-scarce medical imaging applications.

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

Main category: cs.CV

TL;DR: The paper introduces two methods for temporal action localization: Boundary Distance Regression (BDR) for precise boundary detection and Adaptive Temporal Refinement (ATR) for efficient computation allocation, achieving significant performance improvements with reduced computational cost.

Details

Motivation: Current temporal action localization methods apply uniform computation despite significant variations in boundary detection difficulty, leading to inefficient resource allocation and suboptimal performance.

Method: 1) Boundary Distance Regression (BDR): Uses signed-distance regression instead of classification for information-theoretically optimal localization. 2) Adaptive Temporal Refinement (ATR): Allocates computation via continuous depth selection with differentiable optimization, avoiding reinforcement learning.

Result: BDR achieves 43% sharper boundary peaks and consistent 1.8-3.1% mAP@0.7 improvements across architectures. ATR achieves 56.5% mAP@0.7 at 162G FLOPs (vs 53.6% at 198G), providing 2.9% improvement with 18% less compute. Gains scale with boundary heterogeneity (4.2% on short actions).

Conclusion: The proposed methods enable more precise boundary detection and efficient computation allocation, achieving state-of-the-art performance with reduced computational requirements across multiple benchmarks.

Abstract: Temporal action localization requires precise boundary detection; however, current methods apply uniform computation despite significant variations in difficulty across boundaries. We present two complementary contributions. First, Boundary Distance Regression (BDR) provides information-theoretically optimal localization through signed-distance regression rather than classification, achieving 43% sharper boundary peaks. BDR retrofits to existing methods with approximately 50 lines of code, yielding consistent 1.8 to 3.1% mAP@0.7 improvements across diverse architectures. Second, Adaptive Temporal Refinement (ATR) allocates computation via continuous depth selection $\tau \in [0,1]$, enabling end-to-end differentiable optimization without reinforcement learning. On THUMOS14, ATR achieves 56.5% mAP@0.7 at 162G FLOPs, compared to 53.6% at 198G for uniform processing, providing a 2.9% improvement with 18% less compute. Gains scale with boundary heterogeneity, showing 4.2% improvement on short actions. Training cost is mitigated via knowledge distillation, with lightweight students retaining 99% performance at baseline cost. Results are validated across four benchmarks with rigorous statistical testing.

[94] Improving Multi-View Reconstruction via Texture-Guided Gaussian-Mesh Joint Optimization

Zhejia Cai, Puhua Jiang, Shiwei Mao, Hongkun Cao, Ruqi Huang

Main category: cs.CV

TL;DR: Unified framework for joint optimization of mesh geometry and appearance using Gaussian-guided differentiable rendering, enabling high-quality 3D reconstruction for editing tasks.

Details

Motivation: Existing methods decouple geometry and appearance optimization, prioritizing either geometric accuracy or photorealistic rendering, which hinders downstream editing tasks.

Method: Simultaneously optimizes mesh geometry (vertex positions and faces) and vertex colors via Gaussian-guided mesh differentiable rendering, leveraging photometric consistency and geometric regularization from normal and depth maps.

Result: High-quality 3D reconstruction that can be exploited in downstream editing tasks such as relighting and shape deformation.

Conclusion: Advocates unified treatment of geometry and appearance optimization for seamless Gaussian-mesh joint optimization to enable better 3D editing capabilities.

Abstract: Reconstructing real-world objects from multi-view images is essential for applications in 3D editing, AR/VR, and digital content creation. Existing methods typically prioritize either geometric accuracy (Multi-View Stereo) or photorealistic rendering (Novel View Synthesis), often decoupling geometry and appearance optimization, which hinders downstream editing tasks. This paper advocates an unified treatment on geometry and appearance optimization for seamless Gaussian-mesh joint optimization. More specifically, we propose a novel framework that simultaneously optimizes mesh geometry (vertex positions and faces) and vertex colors via Gaussian-guided mesh differentiable rendering, leveraging photometric consistency from input images and geometric regularization from normal and depth maps. The obtained high-quality 3D reconstruction can be further exploit in down-stream editing tasks, such as relighting and shape deformation. The code will be publicly available upon acceptance.

[95] A Linear Fractional Transformation Model and Calibration Method for Light Field Camera

Zhong Chen, Changfeng Chen

Main category: cs.CV

TL;DR: Proposes a linear fractional transformation parameter to decouple main lens and micro lens array for light field camera calibration, with analytical solution and nonlinear refinement.

Details

Motivation: Accurate calibration of internal parameters is crucial but challenging for 3D reconstruction using light field cameras.

Method: Uses linear fractional transformation parameter α to decouple main lens and micro lens array, with analytical least squares solution followed by nonlinear refinement, plus feature detection from raw images.

Result: Experimental results on both physical and simulated data verify the method’s performance, and the model enables faster simulation of raw light field images.

Conclusion: The proposed calibration method is effective and enables faster light field image simulation, which benefits data-driven deep learning approaches.

Abstract: Accurate calibration of internal parameters is a crucial yet challenging prerequisite for 3D reconstruction using light field cameras. In this paper, we propose a linear fractional transformation(LFT) parameter $\alpha$ to decoupled the main lens and micro lens array (MLA). The proposed method includes an analytical solution based on least squares, followed by nonlinear refinement. The method for detecting features from the raw images is also introduced. Experimental results on both physical and simulated data have verified the performance of proposed method. Based on proposed model, the simulation of raw light field images becomes faster, which is crucial for data-driven deep learning methods. The corresponding code can be obtained from the author’s website.

[96] Room Envelopes: A Synthetic Dataset for Indoor Layout Reconstruction from Images

Sam Bahrami, Dylan Campbell

Main category: cs.CV

TL;DR: A method for predicting complete scene layouts including occluded surfaces using monocular images, with a focus on structural elements like walls, floors, and ceilings.

Details

Motivation: Current scene reconstruction methods only recover visible surfaces, leading to incomplete reconstructions that miss occluded surfaces. Structural scene elements should be easier to predict due to their planar, repetitive nature.

Method: Created a synthetic dataset (Room Envelopes) with RGB images and two pointmaps per image: one for visible surfaces and one for structural layout surfaces after removing fixtures. This enables direct supervision for monocular geometry estimators.

Result: The approach enables prediction of both visible surfaces and structural layout surfaces, providing understanding of scene extent and object shapes/locations.

Conclusion: The method successfully addresses the limitation of incomplete scene reconstructions by predicting occluded structural elements, offering a more complete understanding of 3D scenes.

Abstract: Modern scene reconstruction methods are able to accurately recover 3D surfaces that are visible in one or more images. However, this leads to incomplete reconstructions, missing all occluded surfaces. While much progress has been made on reconstructing entire objects given partial observations using generative models, the structural elements of a scene, like the walls, floors and ceilings, have received less attention. We argue that these scene elements should be relatively easy to predict, since they are typically planar, repetitive and simple, and so less costly approaches may be suitable. In this work, we present a synthetic dataset – Room Envelopes – that facilitates progress on this task by providing a set of RGB images and two associated pointmaps for each image: one capturing the visible surface and one capturing the first surface once fittings and fixtures are removed, that is, the structural layout. As we show, this enables direct supervision for feed-forward monocular geometry estimators that predict both the first visible surface and the first layout surface. This confers an understanding of the scene’s extent, as well as the shape and location of its objects.

Wenshuo Qin, Leyla Isik

Main category: cs.CV

TL;DR: Humans use 3D visuospatial pose information for social interaction recognition, which outperforms current AI vision models. Simple 3D social pose features (face position and direction) match full joint data and improve AI performance.

Details

Motivation: To understand why humans excel at social interaction recognition while AI struggles, and test if 3D pose information is the key factor missing in current vision models.

Method: Combined pose and depth estimation to extract 3D joint positions from videos, compared with AI vision models, and derived minimal 3D social pose features (face position and direction).

Result: 3D joint positions outperformed most AI models. Minimal 3D social pose features matched full joint data’s predictive power and significantly improved AI model performance when combined.

Conclusion: Human social scene understanding relies on explicit 3D pose representations and can be supported by simple visuospatial primitives, with 3D pose feature representation predicting AI model performance.

Abstract: Humans can quickly and effortlessly extract a variety of information about others’ social interactions from visual input, ranging from visuospatial cues like whether two people are facing each other to higher-level information. Yet, the computations supporting these abilities remain poorly understood, and social interaction recognition continues to challenge even the most advanced AI vision systems. Here, we hypothesized that humans rely on 3D visuospatial pose information to make social interaction judgments, which is absent in most AI vision models. To test this, we combined state-of-the-art pose and depth estimation algorithms to extract 3D joint positions of people in short video clips depicting everyday human actions and compared their ability to predict human social interaction judgments with current AI vision models. Strikingly, 3D joint positions outperformed most current AI vision models, revealing that key social information is available in explicit body position but not in the learned features of most vision models, including even the layer-wise embeddings of the pose models used to extract joint positions. To uncover the critical pose features humans use to make social judgments, we derived a compact set of 3D social pose features describing only the 3D position and direction of faces in the videos. We found that these minimal descriptors matched the predictive strength of the full set of 3D joints and significantly improved the performance of off-the-shelf AI vision models when combined with their embeddings. Moreover, the degree to which 3D social pose features were represented in each off-the-shelf AI vision model predicted the model’s ability to match human social judgments. Together, our findings provide strong evidence that human social scene understanding relies on explicit representations of 3D pose and can be supported by simple, structured visuospatial primitives.

[98] Caption-Driven Explainability: Probing CNNs for Bias via CLIP

Patrick Koller, Amil V. Dravid, Guido M. Schuster, Aggelos K. Katsaggelos

Main category: cs.CV

TL;DR: This paper proposes a caption-based XAI method that integrates standalone ML models into CLIP using network surgery to identify dominant concepts in predictions, improving robustness against covariate shifts.

Details

Motivation: Traditional saliency map XAI methods can be misleading when spurious and salient features overlap in pixel space, creating robustness problems in ML models.

Method: Integrates standalone models into CLIP using novel network surgery approach to create caption-based XAI that identifies dominant concepts contributing to predictions.

Result: The proposed method minimizes the risk of models falling for covariate shifts and helps develop more robust ML models.

Conclusion: Caption-based XAI using CLIP integration through network surgery provides more reliable explanations than traditional saliency maps, significantly contributing to ML robustness.

Abstract: Robustness has become one of the most critical problems in machine learning (ML). The science of interpreting ML models to understand their behavior and improve their robustness is referred to as explainable artificial intelligence (XAI). One of the state-of-the-art XAI methods for computer vision problems is to generate saliency maps. A saliency map highlights the pixel space of an image that excites the ML model the most. However, this property could be misleading if spurious and salient features are present in overlapping pixel spaces. In this paper, we propose a caption-based XAI method, which integrates a standalone model to be explained into the contrastive language-image pre-training (CLIP) model using a novel network surgery approach. The resulting caption-based XAI model identifies the dominant concept that contributes the most to the models prediction. This explanation minimizes the risk of the standalone model falling for a covariate shift and contributes significantly towards developing robust ML models. Our code is available at https://github.com/patch0816/caption-driven-xai

[99] CaRF: Enhancing Multi-View Consistency in Referring 3D Gaussian Splatting Segmentation

Yuwen Tao, Kanglei Zhou, Xin Tan, Yuan Xie

Main category: cs.CV

TL;DR: CaRF is a camera-aware 3D Gaussian segmentation framework that achieves multi-view consistency by incorporating camera geometry into Gaussian-text interactions and aligning per-Gaussian logits across calibrated views during training.

Details

Motivation: Existing 3D Gaussian segmentation methods struggle with cross-view consistency due to reliance on 2D rendered pseudo supervision and view-specific feature learning, leading to inconsistent 3D region localization.

Method: Proposes Gaussian Field Camera Encoding (GFCE) to incorporate camera geometry into Gaussian-text interactions, and In Training Paired View Supervision (ITPVS) to align per-Gaussian logits across calibrated views during training.

Result: Achieves average improvements of 16.8%, 4.3%, and 2.0% in mIoU over state-of-the-art methods on Ref LERF, LERF OVS, and 3D OVS datasets respectively.

Conclusion: CaRF enables more reliable and view-consistent 3D scene understanding, with potential applications in embodied AI, AR/VR interaction, and autonomous perception.

Abstract: Referring 3D Gaussian Splatting Segmentation (R3DGS) aims to interpret free-form language expressions and localize the corresponding 3D regions in Gaussian fields. While recent advances have introduced cross-modal alignment between language and 3D geometry, existing pipelines still struggle with cross-view consistency due to their reliance on 2D rendered pseudo supervision and view specific feature learning. In this work, we present Camera Aware Referring Field (CaRF), a fully differentiable framework that operates directly in the 3D Gaussian space and achieves multi view consistency. Specifically, CaRF introduces Gaussian Field Camera Encoding (GFCE), which incorporates camera geometry into Gaussian text interactions to explicitly model view dependent variations and enhance geometric reasoning. Building on this, In Training Paired View Supervision (ITPVS) is proposed to align per Gaussian logits across calibrated views during training, effectively mitigating single view overfitting and exposing inter view discrepancies for optimization. Extensive experiments on three representative benchmarks demonstrate that CaRF achieves average improvements of 16.8%, 4.3%, and 2.0% in mIoU over state of the art methods on the Ref LERF, LERF OVS, and 3D OVS datasets, respectively. Moreover, this work promotes more reliable and view consistent 3D scene understanding, with potential benefits for embodied AI, AR/VR interaction, and autonomous perception.

[100] PhysCorr: Dual-Reward DPO for Physics-Constrained Text-to-Video Generation with Automated Preference Selection

Peiyao Wang, Weining Wang, Qi Li

Main category: cs.CV

TL;DR: PhysCorr is a framework for improving physical consistency in text-to-video generation by introducing PhysicsRM reward model and PhyDPO optimization pipeline.

Details

Motivation: Current text-to-video models produce visually impressive results but often violate physical plausibility, limiting their use in embodied AI, robotics, and simulation applications.

Method: Proposes PhysicsRM (dual-dimensional reward model for intra-object stability and inter-object interactions) and PhyDPO (direct preference optimization with contrastive feedback and physics-aware reweighting).

Result: Significant improvements in physical realism while maintaining visual fidelity and semantic alignment across multiple benchmarks.

Conclusion: PhysCorr advances physically grounded and trustworthy video generation, enabling better deployment in real-world applications.

Abstract: Recent advances in text-to-video generation have achieved impressive perceptual quality, yet generated content often violates fundamental principles of physical plausibility - manifesting as implausible object dynamics, incoherent interactions, and unrealistic motion patterns. Such failures hinder the deployment of video generation models in embodied AI, robotics, and simulation-intensive domains. To bridge this gap, we propose PhysCorr, a unified framework for modeling, evaluating, and optimizing physical consistency in video generation. Specifically, we introduce PhysicsRM, the first dual-dimensional reward model that quantifies both intra-object stability and inter-object interactions. On this foundation, we develop PhyDPO, a novel direct preference optimization pipeline that leverages contrastive feedback and physics-aware reweighting to guide generation toward physically coherent outputs. Our approach is model-agnostic and scalable, enabling seamless integration into a wide range of video diffusion and transformer-based backbones. Extensive experiments across multiple benchmarks demonstrate that PhysCorr achieves significant improvements in physical realism while preserving visual fidelity and semantic alignment. This work takes a critical step toward physically grounded and trustworthy video generation.

[101] GNN-MoE: Context-Aware Patch Routing using GNNs for Parameter-Efficient Domain Generalization

Mahmoud Soliman, Omar Abdelaziz, Ahmed Radwan, Anand, Mohamed Shehata

Main category: cs.CV

TL;DR: GNN-MoE enhances Parameter-Efficient Fine-Tuning for domain generalization in Vision Transformers using a Mixture-of-Experts framework with Graph Neural Network routing on inter-patch graphs.

Details

Motivation: Standard fine-tuning of pretrained Vision Transformers for domain generalization is computationally expensive and can impair generalization performance, creating a need for more efficient adaptation methods.

Method: Proposes GNN-MoE framework that uses efficient Kronecker adapters and replaces token-based routing with a novel GNN router (GCN, GAT, SAGE) operating on inter-patch graphs to dynamically assign patches to specialized experts.

Result: Achieves state-of-the-art or competitive performance on domain generalization benchmarks while maintaining high parameter efficiency.

Conclusion: Graph-based contextual routing through GNNs provides an effective approach for robust and lightweight domain generalization in Vision Transformers.

Abstract: Domain generalization (DG) seeks robust Vision Transformer (ViT) performance on unseen domains. Efficiently adapting pretrained ViTs for DG is challenging; standard fine-tuning is costly and can impair generalization. We propose GNN-MoE, enhancing Parameter-Efficient Fine-Tuning (PEFT) for DG with a Mixture-of-Experts (MoE) framework using efficient Kronecker adapters. Instead of token-based routing, a novel Graph Neural Network (GNN) router (GCN, GAT, SAGE) operates on inter-patch graphs to dynamically assign patches to specialized experts. This context-aware GNN routing leverages inter-patch relationships for better adaptation to domain shifts. GNN-MoE achieves state-of-the-art or competitive DG benchmark performance with high parameter efficiency, highlighting the utility of graph-based contextual routing for robust, lightweight DG.

[102] MedDChest: A Content-Aware Multimodal Foundational Vision Model for Thoracic Imaging

Mahmoud Soliman, Islam Osman, Mohamed S. Shehata, Rasika Rajapakshe

Main category: cs.CV

TL;DR: MedDChest is a foundational Vision Transformer pre-trained from scratch on 1.2M thoracic images, using novel Guided Random Resized Crops augmentation, achieving superior performance over ImageNet-pretrained models for medical imaging tasks.

Details

Motivation: Overcome the domain gap between natural images and medical imaging by creating a specialized vision model optimized for thoracic imaging rather than relying on fine-tuning models pre-trained on out-of-domain natural images.

Method: Pre-trained MedDChest from scratch on 1.2M curated multimodal thoracic images (Chest X-ray and CT) from 10 public sources, using novel Guided Random Resized Crops data augmentation that biases sampling towards anatomically relevant regions.

Result: MedDChest significantly outperforms strong, publicly available ImageNet-pretrained models across diverse downstream diagnostic tasks, demonstrating the superiority of large-scale in-domain pre-training with domain-specific data augmentation.

Conclusion: MedDChest provides a powerful and robust feature extractor that serves as a significantly better starting point for thoracic diagnostic tasks, establishing the value of in-domain pre-training combined with domain-specific data augmentation for medical imaging.

Abstract: The performance of vision models in medical imaging is often hindered by the prevailing paradigm of fine-tuning backbones pre-trained on out-of-domain natural images. To address this fundamental domain gap, we propose MedDChest, a new foundational Vision Transformer (ViT) model optimized specifically for thoracic imaging. We pre-trained MedDChest from scratch on a massive, curated, multimodal dataset of over 1.2 million images, encompassing different modalities including Chest X-ray and Computed Tomography (CT) compiled from 10 public sources. A core technical contribution of our work is Guided Random Resized Crops, a novel content-aware data augmentation strategy that biases sampling towards anatomically relevant regions, overcoming the inefficiency of standard cropping techniques on medical scans. We validate our model’s effectiveness by fine-tuning it on a diverse set of downstream diagnostic tasks. Comprehensive experiments empirically demonstrate that MedDChest significantly outperforms strong, publicly available ImageNet-pretrained models. By establishing the superiority of large-scale, in-domain pre-training combined with domain-specific data augmentation, MedDChest provides a powerful and robust feature extractor that serves as a significantly better starting point for a wide array of thoracic diagnostic tasks. The model weights will be made publicly available to foster future research and applications.

[103] Near-Lossless 3D Voxel Representation Free from Iso-surface

Yihao Luo, Xianglong He, Chuanyu Pan, Yiwen Chen, Jiaqi Wu, Yangguang Li, Wanli Ouyang, Yuanming Hu, Guang Yang, ChoonHwai Yap

Main category: cs.CV

TL;DR: Faithful Contouring is a sparse voxelized representation for 3D meshes that achieves near-lossless fidelity at 2048+ resolutions without requiring field functions or isosurface extraction, outperforming existing methods in accuracy and efficiency.

Details

Motivation: Existing voxelized representations based on iso-surface methods rely on water-tightening or rendering optimization, which compromise geometric fidelity and cannot preserve sharp features and internal structures.

Method: Proposes Faithful Contouring - a sparse voxelized representation that avoids converting meshes to field functions or extracting isosurface during remeshing. Also designs a dual-mode autoencoder for scalable and detail-preserving shape reconstruction.

Result: Achieves distance errors at 10^-5 level for direct representation. For mesh reconstruction, yields 93% reduction in Chamfer Distance and 35% improvement in F-score over strong baselines. Preserves sharpness and internal structures even for complex geometry and topology.

Conclusion: Faithful Contouring surpasses existing methods in accuracy and efficiency for both representation and reconstruction, confirming superior fidelity as a representation for 3D learning tasks while maintaining flexibility for texturing, manipulation, and editing.

Abstract: Accurate and efficient voxelized representations of 3D meshes are the foundation of 3D reconstruction and generation. However, existing representations based on iso-surface heavily rely on water-tightening or rendering optimization, which inevitably compromise geometric fidelity. We propose Faithful Contouring, a sparse voxelized representation that supports 2048+ resolutions for arbitrary meshes, requiring neither converting meshes to field functions nor extracting the isosurface during remeshing. It achieves near-lossless fidelity by preserving sharpness and internal structures, even for challenging cases with complex geometry and topology. The proposed method also shows flexibility for texturing, manipulation, and editing. Beyond representation, we design a dual-mode autoencoder for Faithful Contouring, enabling scalable and detail-preserving shape reconstruction. Extensive experiments show that Faithful Contouring surpasses existing methods in accuracy and efficiency for both representation and reconstruction. For direct representation, it achieves distance errors at the $10^{-5}$ level; for mesh reconstruction, it yields a 93% reduction in Chamfer Distance and a 35% improvement in F-score over strong baselines, confirming superior fidelity as a representation for 3D learning tasks.

[104] A Hybrid Deep Learning Model for Robust Biometric Authentication from Low-Frame-Rate PPG Signals

Arfina Rahman, Mahesh Banavar

Main category: cs.CV

TL;DR: A lightweight biometric authentication system using PPG signals from fingertip videos achieves 98% accuracy with a hybrid CVT-ConvMixer-LSTM model that processes time-frequency scalograms.

Details

Motivation: PPG signals offer non-invasive biometric authentication with inherent liveness detection, but face challenges from motion artifacts and physiological variability that require robust feature extraction.

Method: PPG signals from low-frame-rate fingertip videos are preprocessed (baseline removal, PCA, filtering) and converted to time-frequency scalograms via CWT. A hybrid CVT-ConvMixer-LSTM model extracts spatial and temporal features for authentication.

Result: The system achieved 98% authentication accuracy on 46 subjects, demonstrating robustness to noise and inter-subject variability.

Conclusion: The proposed framework is efficient, scalable, and suitable for real-world mobile biometric security applications due to its liveness detection capabilities.

Abstract: Photoplethysmography (PPG) signals, which measure changes in blood volume in the skin using light, have recently gained attention in biometric authentication because of their non-invasive acquisition, inherent liveness detection, and suitability for low-cost wearable devices. However, PPG signal quality is challenged by motion artifacts, illumination changes, and inter-subject physiological variability, making robust feature extraction and classification crucial. This study proposes a lightweight and cost-effective biometric authentication framework based on PPG signals extracted from low-frame-rate fingertip videos. The CFIHSR dataset, comprising PPG recordings from 46 subjects at a sampling rate of 14 Hz, is employed for evaluation. The raw PPG signals undergo a standard preprocessing pipeline involving baseline drift removal, motion artifact suppression using Principal Component Analysis (PCA), bandpass filtering, Fourier-based resampling, and amplitude normalization. To generate robust representations, each one-dimensional PPG segment is converted into a two-dimensional time-frequency scalogram via the Continuous Wavelet Transform (CWT), effectively capturing transient cardiovascular dynamics. We developed a hybrid deep learning model, termed CVT-ConvMixer-LSTM, by combining spatial features from the Convolutional Vision Transformer (CVT) and ConvMixer branches with temporal features from a Long Short-Term Memory network (LSTM). The experimental results on 46 subjects demonstrate an authentication accuracy of 98%, validating the robustness of the model to noise and variability between subjects. Due to its efficiency, scalability, and inherent liveness detection capability, the proposed system is well-suited for real-world mobile and embedded biometric security applications.

Zehui Feng, Chenqi Zhang, Mingru Wang, Minuo Wei, Shiwei Cheng, Cuntai Guan, Ting Han

Main category: cs.CV

TL;DR: Bratrix is a multimodal Language-Anchored Vision-Brain alignment framework that improves visual semantics decoding from neural signals (EEG, MEG, fMRI) by projecting visual and brain representations into a shared latent space with language anchors, incorporating uncertainty perception for robustness.

Details

Motivation: Existing approaches align neural activity directly with visual embeddings, but visual-only representations fail to capture latent semantic dimensions, limiting interpretability and deep robustness due to subject variability and entangled visual features.

Method: Decouples visual stimuli into hierarchical visual and linguistic semantic components, projects both visual and brain representations into shared latent space, incorporates uncertainty perception module for weighting, uses learnable language-anchored semantic matrices, and employs two-stage training (single-modality pretraining + multimodal fine-tuning).

Result: Extensive experiments on EEG, MEG, and fMRI benchmarks show improved retrieval, reconstruction, and captioning performance, specifically surpassing 14.3% in 200-way EEG retrieval task compared to state-of-the-art methods.

Conclusion: Bratrix achieves effective multimodal Language-Anchored Vision-Brain alignment, demonstrating superior performance in decoding visual semantics from neural signals through its novel framework design and training strategy.

Abstract: Unveiling visual semantics from neural signals such as EEG, MEG, and fMRI remains a fundamental challenge due to subject variability and the entangled nature of visual features. Existing approaches primarily align neural activity directly with visual embeddings, but visual-only representations often fail to capture latent semantic dimensions, limiting interpretability and deep robustness. To address these limitations, we propose Bratrix, the first end-to-end framework to achieve multimodal Language-Anchored Vision-Brain alignment. Bratrix decouples visual stimuli into hierarchical visual and linguistic semantic components, and projects both visual and brain representations into a shared latent space, enabling the formation of aligned visual-language and brain-language embeddings. To emulate human-like perceptual reliability and handle noisy neural signals, Bratrix incorporates a novel uncertainty perception module that applies uncertainty-aware weighting during alignment. By leveraging learnable language-anchored semantic matrices to enhance cross-modal correlations and employing a two-stage training strategy of single-modality pretraining followed by multimodal fine-tuning, Bratrix-M improves alignment precision. Extensive experiments on EEG, MEG, and fMRI benchmarks demonstrate that Bratrix improves retrieval, reconstruction, and captioning performance compared to state-of-the-art methods, specifically surpassing 14.3% in 200-way EEG retrieval task. Code and model are available.

[106] Adversarial and Score-Based CT Denoising: CycleGAN vs Noise2Score

Abu Hanif Muhammad Syarubany

Main category: cs.CV

TL;DR: Comparative study of CycleGAN-based residual translator and Noise2Score for CT image denoising in unpaired and self-supervised settings, with CycleGAN achieving superior final image quality.

Details

Motivation: To address CT image denoising in scenarios where paired training data is unavailable, exploring training-data-efficient paradigms that don't require clean-noisy image pairs.

Method: Evaluated two approaches: 1) CycleGAN-based residual translator with optimized U-Net backbone (lambda_cycle=30, lambda_iden=2, ngf=ndf=64), 2) Noise2Score score-matching denoiser for pair-free denoising.

Result: CycleGAN improved noisy input from 34.66 dB/0.9234 SSIM to 38.913 dB/0.971 SSIM, achieving estimated score of 1.9441 and unseen-set score of 1.9343. Noise2Score showed strong performance on very noisy inputs despite slightly lower absolute metrics.

Conclusion: CycleGAN provides strongest final image quality, while Noise2Score offers robust pair-free alternative with competitive performance, making both valuable for different denoising scenarios.

Abstract: We study CT image denoising in the unpaired and self-supervised regimes by evaluating two strong, training-data-efficient paradigms: a CycleGAN-based residual translator and a Noise2Score (N2S) score-matching denoiser. Under a common evaluation protocol, a configuration sweep identifies a simple standard U-Net backbone within CycleGAN (lambda_cycle = 30, lambda_iden = 2, ngf = ndf = 64) as the most reliable setting; we then train it to convergence with a longer schedule. The selected CycleGAN improves the noisy input from 34.66 dB / 0.9234 SSIM to 38.913 dB / 0.971 SSIM and attains an estimated score of 1.9441 and an unseen-set (Kaggle leaderboard) score of 1.9343. Noise2Score, while slightly behind in absolute PSNR / SSIM, achieves large gains over very noisy inputs, highlighting its utility when clean pairs are unavailable. Overall, CycleGAN offers the strongest final image quality, whereas Noise2Score provides a robust pair-free alternative with competitive performance. Source code is available at https://github.com/hanifsyarubany/CT-Scan-Image-Denoising-using-CycleGAN-and-Noise2Score.

[107] When Swin Transformer Meets KANs: An Improved Transformer Architecture for Medical Image Segmentation

Nishchal Sapkota, Haoyan Shi, Yejia Zhang, Xianshi Ma, Bofang Zheng, Danny Z. Chen

Main category: cs.CV

TL;DR: UKAST integrates rational-function based Kolmogorov-Arnold Networks into Swin Transformer encoders for medical image segmentation, achieving state-of-the-art performance with improved data efficiency and reduced computational cost.

Details

Motivation: Address limitations of CNN-based methods in modeling long-range dependencies and Transformer-based methods being data-hungry and computationally expensive for medical image segmentation.

Method: Propose UKAST architecture that integrates rational-function based KANs into Swin Transformer encoders, using Group Rational KANs (GR-KANs) from Kolmogorov-Arnold Transformer to improve efficiency over vanilla spline-based KANs.

Result: Achieves state-of-the-art performance on four diverse 2D and 3D medical image segmentation benchmarks, surpassing CNN- and Transformer-based baselines with superior accuracy in data-scarce settings and reduced FLOPs.

Conclusion: KAN-enhanced Transformers show strong potential for advancing data-efficient medical image segmentation, effectively addressing the data-hungry limitations of standard Vision Transformers.

Abstract: Medical image segmentation is critical for accurate diagnostics and treatment planning, but remains challenging due to complex anatomical structures and limited annotated training data. CNN-based segmentation methods excel at local feature extraction, but struggle with modeling long-range dependencies. Transformers, on the other hand, capture global context more effectively, but are inherently data-hungry and computationally expensive. In this work, we introduce UKAST, a U-Net like architecture that integrates rational-function based Kolmogorov-Arnold Networks (KANs) into Swin Transformer encoders. By leveraging rational base functions and Group Rational KANs (GR-KANs) from the Kolmogorov-Arnold Transformer (KAT), our architecture addresses the inefficiencies of vanilla spline-based KANs, yielding a more expressive and data-efficient framework with reduced FLOPs and only a very small increase in parameter count compared to SwinUNETR. UKAST achieves state-of-the-art performance on four diverse 2D and 3D medical image segmentation benchmarks, consistently surpassing both CNN- and Transformer-based baselines. Notably, it attains superior accuracy in data-scarce settings, alleviating the data-hungry limitations of standard Vision Transformers. These results show the potential of KAN-enhanced Transformers to advance data-efficient medical image segmentation. Code is available at: https://github.com/nsapkota417/UKAST

[108] SpatialLock: Precise Spatial Control in Text-to-Image Synthesis

Biao Liu, Yuanzhi Liang

Main category: cs.CV

TL;DR: SpatialLock is a novel framework that uses perception signals and grounding information to achieve precise object localization in text-to-image generation, addressing the challenge of controlling object spatial layouts.

Details

Motivation: Existing text-to-image synthesis methods fail to fully utilize positional information, leading to inadequate understanding of object spatial layouts and imprecise control over object localization in generated images.

Method: SpatialLock incorporates two components: Position-Engaged Injection (PoI) that directly integrates spatial information through an attention layer, and Position-Guided Learning (PoG) that employs perception-based supervision to refine object localization.

Result: Experiments show SpatialLock achieves state-of-the-art performance for precise object positioning, with IOU scores above 0.9 across multiple datasets, and improves the visual quality of generated images.

Conclusion: SpatialLock successfully enables precise control over object spatial arrangements in text-to-image generation through joint utilization of perception signals and grounding information.

Abstract: Text-to-Image (T2I) synthesis has made significant advancements in recent years, driving applications such as generating datasets automatically. However, precise control over object localization in generated images remains a challenge. Existing methods fail to fully utilize positional information, leading to an inadequate understanding of object spatial layouts. To address this issue, we propose SpatialLock, a novel framework that leverages perception signals and grounding information to jointly control the generation of spatial locations. SpatialLock incorporates two components: Position-Engaged Injection (PoI) and Position-Guided Learning (PoG). PoI directly integrates spatial information through an attention layer, encouraging the model to learn the grounding information effectively. PoG employs perception-based supervision to further refine object localization. Together, these components enable the model to generate objects with precise spatial arrangements and improve the visual quality of the generated images. Experiments show that SpatialLock sets a new state-of-the-art for precise object positioning, achieving IOU scores above 0.9 across multiple datasets.

[109] Tortoise and Hare Guidance: Accelerating Diffusion Model Inference with Multirate Integration

Yunghee Lee, Byeonghyun Pak, Junwha Hong, Hoseong Kim

Main category: cs.CV

TL;DR: THG is a training-free diffusion sampling acceleration method that treats classifier-free guidance as a multirate ODE system, allowing different integration rates for noise estimation and guidance terms to reduce computation by up to 30% without quality loss.

Details

Motivation: Current diffusion samplers treat all components equally, failing to exploit the different sensitivity to numerical errors between noise estimation and guidance terms, leading to computational inefficiency.

Method: Reformulates CFG as multirate ODE system, uses tortoise equation for noise estimation on fine grid and hare equation for guidance on coarse grid, with adaptive timestep sampling and guidance-scale scheduling.

Result: Reduces NFE by up to 30% with minimal quality degradation (ΔImageReward ≤ 0.032), outperforms state-of-the-art training-free accelerators under same computation budget.

Conclusion: Multirate formulations offer significant potential for diffusion solver optimization, enabling real-time high-quality image synthesis without model retraining.

Abstract: In this paper, we propose Tortoise and Hare Guidance (THG), a training-free strategy that accelerates diffusion sampling while maintaining high-fidelity generation. We demonstrate that the noise estimate and the additional guidance term exhibit markedly different sensitivity to numerical error by reformulating the classifier-free guidance (CFG) ODE as a multirate system of ODEs. Our error-bound analysis shows that the additional guidance branch is more robust to approximation, revealing substantial redundancy that conventional solvers fail to exploit. Building on this insight, THG significantly reduces the computation of the additional guidance: the noise estimate is integrated with the tortoise equation on the original, fine-grained timestep grid, while the additional guidance is integrated with the hare equation only on a coarse grid. We also introduce (i) an error-bound-aware timestep sampler that adaptively selects step sizes and (ii) a guidance-scale scheduler that stabilizes large extrapolation spans. THG reduces the number of function evaluations (NFE) by up to 30% with virtually no loss in generation fidelity ($\Delta$ImageReward $\leq$ 0.032) and outperforms state-of-the-art CFG-based training-free accelerators under identical computation budgets. Our findings highlight the potential of multirate formulations for diffusion solvers, paving the way for real-time high-quality image synthesis without any model retraining. The source code is available at https://github.com/yhlee-add/THG.

[110] Text to Sketch Generation with Multi-Styles

Tengjie Li, Shikui Tu, Lei Xu

Main category: cs.CV

TL;DR: A training-free diffusion framework for sketch generation with precise style control using text prompts and reference sketches, featuring style-content guidance to reduce content leakage and support multi-style generation.

Details

Motivation: Existing sketch generation methods lack mechanisms for precise style control and often suffer from content leakage when using reference sketches for style transfer.

Method: Training-free diffusion framework that incorporates reference features as auxiliary information with linear smoothing and style-content guidance, avoiding overwriting self-attention matrices. Also supports multi-style generation via joint AdaIN module.

Result: Achieves high-quality sketch generation with accurate style alignment, reduced content leakage, and improved flexibility in style control, especially effective when reference and target sketches have low structural similarity.

Conclusion: The proposed framework enables precise style control in sketch generation while maintaining content integrity, offering a flexible solution for single and multi-style synthesis without requiring training.

Abstract: Recent advances in vision-language models have facilitated progress in sketch generation. However, existing specialized methods primarily focus on generic synthesis and lack mechanisms for precise control over sketch styles. In this work, we propose a training-free framework based on diffusion models that enables explicit style guidance via textual prompts and referenced style sketches. Unlike previous style transfer methods that overwrite key and value matrices in self-attention, we incorporate the reference features as auxiliary information with linear smoothing and leverage a style-content guidance mechanism. This design effectively reduces content leakage from reference sketches and enhances synthesis quality, especially in cases with low structural similarity between reference and target sketches. Furthermore, we extend our framework to support controllable multi-style generation by integrating features from multiple reference sketches, coordinated via a joint AdaIN module. Extensive experiments demonstrate that our approach achieves high-quality sketch generation with accurate style alignment and improved flexibility in style control. The official implementation of M3S is available at https://github.com/CMACH508/M3S.

[111] Automated Tennis Player and Ball Tracking with Court Keypoints Detection (Hawk Eye System)

Venkata Manikanta Desu, Syed Fawaz Ali

Main category: cs.CV

TL;DR: Complete pipeline for automated tennis match analysis using deep learning models for player/ball detection, tracking, and court keypoint detection, providing detailed analytics and annotated videos.

Details

Motivation: To enable automated analysis of tennis matches for coaches, broadcasters, and players by providing real-time detection and tracking of players and ball with spatial reference through court keypoints.

Method: Integrated deep learning framework using YOLOv8 for player detection, custom-trained YOLOv5 for ball tracking, and ResNet50-based architecture for court keypoint detection.

Result: Robust performance in varying court conditions and match scenarios, generating annotated videos with detailed performance metrics including player movement patterns, ball speed, shot accuracy, and reaction times.

Conclusion: The system successfully provides actionable insights into game dynamics through automated analysis, demonstrating practical utility for tennis match analysis applications.

Abstract: This study presents a complete pipeline for automated tennis match analysis. Our framework integrates multiple deep learning models to detect and track players and the tennis ball in real time, while also identifying court keypoints for spatial reference. Using YOLOv8 for player detection, a custom-trained YOLOv5 model for ball tracking, and a ResNet50-based architecture for court keypoint detection, our system provides detailed analytics including player movement patterns, ball speed, shot accuracy, and player reaction times. The experimental results demonstrate robust performance in varying court conditions and match scenarios. The model outputs an annotated video along with detailed performance metrics, enabling coaches, broadcasters, and players to gain actionable insights into the dynamics of the game.

[112] DMSORT: An efficient parallel maritime multi-object tracking architecture for unmanned vessel platforms

Shengyu Tang, Zeyuan Lu, Jiazhi Dong, Changdong Yu, Xiaoyu Wang, Yaohui Lyu, Weihao Xia

Main category: cs.CV

TL;DR: DMSORT is an efficient dual-branch maritime multi-object tracking method that combines object detection/ReID with camera motion compensation to handle challenging marine environments with camera jitter and visual degradation.

Details

Motivation: Complicated maritime environments cause camera motion and visual degradation, posing significant challenges to multi-object tracking for safe vessel navigation and maritime surveillance.

Method: Parallel tracker with affine compensation using: 1) RCDN for robust object detection, 2) Li-TAE Transformer for appearance features, 3) Camera motion estimation branch to decouple platform/target motion, 4) Clustering-optimized feature fusion for motion and appearance cues.

Result: Achieves state-of-the-art performance on Singapore Maritime Dataset with fastest runtime among ReID-based MOT frameworks, maintaining high identity consistency and robustness to jitter and occlusion.

Conclusion: DMSORT effectively addresses maritime MOT challenges through its dual-branch architecture with camera motion compensation, demonstrating superior performance and efficiency in real-world marine environments.

Abstract: Accurate perception of the marine environment through robust multi-object tracking (MOT) is essential for ensuring safe vessel navigation and effective maritime surveillance. However, the complicated maritime environment often causes camera motion and subsequent visual degradation, posing significant challenges to MOT. To address this challenge, we propose an efficient Dual-branch Maritime SORT (DMSORT) method for maritime MOT. The core of the framework is a parallel tracker with affine compensation, which incorporates an object detection and re-identification (ReID) branch, along with a dedicated branch for dynamic camera motion estimation. Specifically, a Reversible Columnar Detection Network (RCDN) is integrated into the detection module to leverage multi-level visual features for robust object detection. Furthermore, a lightweight Transformer-based appearance extractor (Li-TAE) is designed to capture global contextual information and generate robust appearance features. Another branch decouples platform-induced and target-intrinsic motion by constructing a projective transformation, applying platform-motion compensation within the Kalman filter, and thereby stabilizing true object trajectories. Finally, a clustering-optimized feature fusion module effectively combines motion and appearance cues to ensure identity consistency under noise, occlusion, and drift. Extensive evaluations on the Singapore Maritime Dataset demonstrate that DMSORT achieves state-of-the-art performance. Notably, DMSORT attains the fastest runtime among existing ReID-based MOT frameworks while maintaining high identity consistency and robustness to jitter and occlusion. Code is available at: https://github.com/BiscuitsLzy/DMSORT-An-efficient-parallel-maritime-multi-object-tracking-architecture-.

[113] Learning from Online Videos at Inference Time for Computer-Use Agents

Yujian Liu, Ze Wang, Hao Chen, Ximeng Sun, Xiaodong Yu, Jialian Wu, Jiang Liu, Emad Barsoum, Zicheng Liu, Shiyu Chang

Main category: cs.CV

TL;DR: A framework that enables computer-use agents to learn from online video tutorials at inference time by retrieving, processing, and dynamically selecting video demonstrations as in-context guidance.

Details

Motivation: Computer-use agents lag behind humans in tasks requiring domain-specific procedural knowledge, while humans can effectively learn from video tutorials by selectively imitating relevant segments.

Method: Proposes a framework that retrieves tutorial videos, converts them into structured demonstration trajectories using a VLM to infer UI actions, segments videos into action subsequences with textual objectives, and uses a two-stage selection mechanism to dynamically choose guidance during execution.

Result: Experiments on two benchmarks show the framework consistently outperforms base agents and variants using only textual tutorials or transcripts.

Conclusion: Online videos can be systematically distilled into actionable guidance that improves computer-use agents at inference time, with trajectory segmentation/selection, action filtering, and visual information being key factors.

Abstract: Computer-use agents can operate computers and automate laborious tasks, but despite recent rapid progress, they still lag behind human users, especially when tasks require domain-specific procedural knowledge about particular applications, platforms, and multi-step workflows. Humans can bridge this gap by watching video tutorials: we search, skim, and selectively imitate short segments that match our current subgoal. In this paper, we study how to enable computer-use agents to learn from online videos at inference time effectively. We propose a framework that retrieves and filters tutorial videos, converts them into structured demonstration trajectories, and dynamically selects trajectories as in-context guidance during execution. Particularly, using a VLM, we infer UI actions, segment videos into short subsequences of actions, and assign each subsequence a textual objective. At inference time, a two-stage selection mechanism dynamically chooses a single trajectory to add in context at each step, focusing the agent on the most helpful local guidance for its next decision. Experiments on two widely used benchmarks show that our framework consistently outperforms strong base agents and variants that use only textual tutorials or transcripts. Analyses highlight the importance of trajectory segmentation and selection, action filtering, and visual information, suggesting that abundant online videos can be systematically distilled into actionable guidance that improves computer-use agents at inference time. Our code is available at https://github.com/UCSB-NLP-Chang/video_demo.

[114] Seeing Straight: Document Orientation Detection for Efficient OCR

Suranjan Goswami, Abhinav Ravi, Raja Kolla, Ali Faraz, Shaharukh Khan, Akash, Chandra Khatri, Shubham Agarwal

Main category: cs.CV

TL;DR: A new benchmark (ORB) for evaluating OCR robustness to image rotations, with a lightweight rotation classification pipeline that achieves 96% accuracy on English and 92% on Indic languages, significantly boosting OCR performance.

Details

Motivation: Accurate rotation correction is essential for enhancing OCR performance in real-world settings where misalignment commonly occurs due to camera orientation errors during document capture.

Method: Introduced OCR-Rotation-Bench (ORB) benchmark with English and 11 Indic language datasets, and developed a fast rotation classification pipeline using Phi-3.5-Vision encoder with dynamic image cropping, fine-tuned for 4-class rotation classification.

Result: Achieved 96% accuracy on English dataset and 92% on Indic dataset for rotation classification. Boosted OCR performance by up to 14% for closed-source models and up to 4x for open-weights models in simulated real-world settings.

Conclusion: The proposed rotation classification module is highly effective and plays a critical role in improving OCR performance across both English and multilingual Indic document scenarios.

Abstract: Despite significant advances in document understanding, determining the correct orientation of scanned or photographed documents remains a critical pre-processing step in the real world settings. Accurate rotation correction is essential for enhancing the performance of downstream tasks such as Optical Character Recognition (OCR) where misalignment commonly arises due to user errors, particularly incorrect base orientations of the camera during capture. In this study, we first introduce OCR-Rotation-Bench (ORB), a new benchmark for evaluating OCR robustness to image rotations, comprising (i) ORB-En, built from rotation-transformed structured and free-form English OCR datasets, and (ii) ORB-Indic, a novel multilingual set spanning 11 Indic mid to low-resource languages. We also present a fast, robust and lightweight rotation classification pipeline built on the vision encoder of Phi-3.5-Vision model with dynamic image cropping, fine-tuned specifically for 4-class rotation task in a standalone fashion. Our method achieves near-perfect 96% and 92% accuracy on identifying the rotations respectively on both the datasets. Beyond classification, we demonstrate the critical role of our module in boosting OCR performance: closed-source (up to 14%) and open-weights models (up to 4x) in the simulated real-world setting.

[115] Systematic Evaluation of Preprocessing Techniques for Accurate Image Registration in Digital Pathology

Fatemehzahra Darzi, Rodrigo Escobar Diaz Guerrero, Thomas Bocklitz

Main category: cs.CV

TL;DR: CycleGAN color transformation achieves the lowest registration errors when aligning H&E stained images with non-linear multimodal images, outperforming other color transformation methods like Macenko, Reinhard, and Vahadane.

Details

Motivation: Accurate registration of images from different modalities is essential in digital pathology for enabling direct comparison and integration of information from different stains or imaging modalities, supporting applications such as biomarker analysis and tissue reconstruction.

Method: Used a dataset of 20 tissue sample pairs with various preprocessing steps including different color transformations (CycleGAN, Macenko, Reinhard, Vahadane), inversion, contrast adjustment, intensity normalization, and denoising. Images were registered using VALIS method with rigid followed by non-rigid registration. Performance was evaluated using relative Target Registration Error (rTRE) metrics and custom point-based evaluation.

Result: CycleGAN color transformation achieved the lowest registration errors in both scenarios (original and inverted multimodal images), while other methods showed higher errors. Registration performance was measured using median of median rTRE (MMrTRE) and average of median rTRE (AMrTRE) values.

Conclusion: Applying color transformation before registration improves alignment between images from different modalities and supports more reliable analysis in digital pathology, with CycleGAN being the most effective method among those tested.

Abstract: Image registration refers to the process of spatially aligning two or more images by mapping them into a common coordinate system, so that corresponding anatomical or tissue structures are matched across images. In digital pathology, registration enables direct comparison and integration of information from different stains or imaging modalities, sup-porting applications such as biomarker analysis and tissue reconstruction. Accurate registration of images from different modalities is an essential step in digital pathology. In this study, we investigated how various color transformation techniques affect image registration between hematoxylin and eosin (H&E) stained images and non-linear multimodal images. We used a dataset of 20 tissue sample pairs, with each pair undergoing several preprocessing steps, including different color transformation (CycleGAN, Macenko, Reinhard, Vahadane), inversion, contrast adjustment, intensity normalization, and denoising. All images were registered using the VALIS registration method, which first applies rigid registration and then performs non-rigid registration in two steps on both low and high-resolution images. Registration performance was evaluated using the relative Target Registration Error (rTRE). We reported the median of median rTRE values (MMrTRE) and the average of median rTRE values (AMrTRE) for each method. In addition, we performed a custom point-based evaluation using ten manually selected key points. Registration was done separately for two scenarios, using either the original or inverted multimodal images. In both scenarios, CycleGAN color transformation achieved the lowest registration errors, while the other methods showed higher errors. These findings show that applying color transformation before registration improves alignment between images from different modalities and supports more reliable analysis in digital pathology.

[116] Covariance Descriptors Meet General Vision Encoders: Riemannian Deep Learning for Medical Image Classification

Josef Mayr, Anna Reithmeir, Maxime Di Folco, Julia A. Schnabel

Main category: cs.CV

TL;DR: Covariance descriptors from pre-trained vision encoders outperform handcrafted features for medical image classification, with SPDNet showing superior performance when combined with DINOv2 features.

Details

Motivation: Covariance descriptors have shown strong performance in general computer vision but remain underexplored in medical imaging, prompting investigation of their effectiveness for medical image classification.

Method: Construct covariance descriptors from features extracted by pre-trained general vision encoders (DINOv2 and MedSAM) and compare with handcrafted descriptors using SPDNet on eleven datasets from MedMNSIT benchmark.

Result: Covariance descriptors from GVE features consistently outperform handcrafted features, and SPDNet yields superior performance to state-of-the-art methods when combined with DINOv2 features.

Conclusion: Combining covariance descriptors with powerful pretrained vision encoders shows significant potential for medical image analysis.

Abstract: Covariance descriptors capture second-order statistics of image features. They have shown strong performance in general computer vision tasks, but remain underexplored in medical imaging. We investigate their effectiveness for both conventional and learning-based medical image classification, with a particular focus on SPDNet, a classification network specifically designed for symmetric positive definite (SPD) matrices. We propose constructing covariance descriptors from features extracted by pre-trained general vision encoders (GVEs) and comparing them with handcrafted descriptors. Two GVEs - DINOv2 and MedSAM - are evaluated across eleven binary and multi-class datasets from the MedMNSIT benchmark. Our results show that covariance descriptors derived from GVE features consistently outperform those derived from handcrafted features. Moreover, SPDNet yields superior performance to state-of-the-art methods when combined with DINOv2 features. Our findings highlight the potential of combining covariance descriptors with powerful pretrained vision encoders for medical image analysis.

[117] AStF: Motion Style Transfer via Adaptive Statistics Fusor

Hanmo Chen, Chenghao Xu, Jiexi Yan, Cheng Deng

Main category: cs.CV

TL;DR: Proposes AStF with skewness and kurtosis statistics for better motion style transfer, outperforming state-of-the-art methods.

Details

Motivation: Traditional image style transfer methods using mean and variance are insufficient for motion data due to complex dynamic patterns and spatiotemporal coherence.

Method: Adaptive Statistics Fusor (AStF) with Style Disentanglement Module and High-Order Multi-Statistics Attention, trained with Motion Consistency Regularization discriminator.

Result: Experimental results show proficiency superiority in motion style transfers over state-of-the-arts by modeling spatiotemporal statistical patterns.

Conclusion: Incorporating skewness and kurtosis provides more comprehensive motion style modeling, achieving better transfer quality.

Abstract: Human motion style transfer allows characters to appear less rigidity and more realism with specific style. Traditional arbitrary image style transfer typically process mean and variance which is proved effective. Meanwhile, similar methods have been adapted for motion style transfer. However, due to the fundamental differences between images and motion, relying on mean and variance is insufficient to fully capture the complex dynamic patterns and spatiotemporal coherence properties of motion data. Building upon this, our key insight is to bring two more coefficient, skewness and kurtosis, into the analysis of motion style. Specifically, we propose a novel Adaptive Statistics Fusor (AStF) which consists of Style Disentanglement Module (SDM) and High-Order Multi-Statistics Attention (HOS-Attn). We trained our AStF in conjunction with a Motion Consistency Regularization (MCR) discriminator. Experimental results show that, by providing a more comprehensive model of the spatiotemporal statistical patterns inherent in dynamic styles, our proposed AStF shows proficiency superiority in motion style transfers over state-of-the-arts. Our code and model are available at https://github.com/CHMimilanlan/AStF.

[118] MedSapiens: Taking a Pose to Rethink Medical Imaging Landmark Detection

Marawan Elbatel, Anbang Wang, Keyuan Liu, Kaouther Mouheb, Enrique Almar-Munoz, Lizhuo Lin, Yanqi Yang, Karim Lekadir, Xiaomeng Li

Main category: cs.CV

TL;DR: MedSapiens adapts human-centric foundation models for anatomical landmark detection in medical imaging, achieving state-of-the-art performance across multiple datasets with significant improvements over both generalist and specialist models.

Details

Motivation: To leverage the untapped potential of human-centric foundation models, which are inherently optimized for spatial pose localization, for anatomical landmark detection in medical imaging instead of relying solely on domain-specific models.

Method: Adapts Sapiens (a human-centric foundation model for pose estimation) to medical imaging through multi-dataset pretraining, creating MedSapiens model.

Result: Achieves up to 5.26% improvement over generalist models and up to 21.81% improvement over specialist models in average success detection rate (SDR). In few-shot settings, achieves 2.69% improvement over state-of-the-art.

Conclusion: Human-centric foundation models provide strong priors for anatomical landmark detection and their adaptation to medical imaging establishes new state-of-the-art performance across multiple datasets.

Abstract: This paper does not introduce a novel architecture; instead, it revisits a fundamental yet overlooked baseline: adapting human-centric foundation models for anatomical landmark detection in medical imaging. While landmark detection has traditionally relied on domain-specific models, the emergence of large-scale pre-trained vision models presents new opportunities. In this study, we investigate the adaptation of Sapiens, a human-centric foundation model designed for pose estimation, to medical imaging through multi-dataset pretraining, establishing a new state of the art across multiple datasets. Our proposed model, MedSapiens, demonstrates that human-centric foundation models, inherently optimized for spatial pose localization, provide strong priors for anatomical landmark detection, yet this potential has remained largely untapped. We benchmark MedSapiens against existing state-of-the-art models, achieving up to 5.26% improvement over generalist models and up to 21.81% improvement over specialist models in the average success detection rate (SDR). To further assess MedSapiens adaptability to novel downstream tasks with few annotations, we evaluate its performance in limited-data settings, achieving 2.69% improvement over the few-shot state of the art in SDR. Code and model weights are available at https://github.com/xmed-lab/MedSapiens .

[119] Proto-LeakNet: Towards Signal-Leak Aware Attribution in Synthetic Human Face Imagery

Claudio Giusti, Luca Guarnera, Sebastiano Battiato

Main category: cs.CV

TL;DR: Proto-LeakNet is an interpretable attribution framework that detects signal leaks in diffusion model outputs for AI-image and deepfake forensics, achieving 98.13% Macro AUC without retraining for unseen generators.

Details

Motivation: The growing sophistication of synthetic image and deepfake generation models has made source attribution and authenticity verification a critical challenge, as diffusion pipelines unintentionally imprint persistent statistical traces (signal leaks) in their outputs.

Method: The framework operates in the latent domain of diffusion models, re-simulating partial forward diffusion to expose generator-specific cues. It uses a temporal attention encoder to aggregate multi-step latent features and a feature-weighted prototype head to structure the embedding space for transparent attribution.

Result: Proto-LeakNet achieves a Macro AUC of 98.13%, learns a latent geometry that remains robust under post-processing, surpasses state-of-the-art methods, and achieves strong separability between known and unseen generators.

Conclusion: Modeling signal-leak bias in latent space enables reliable and interpretable AI-image and deepfake forensics, demonstrating that signal leaks can be effectively leveraged for source attribution.

Abstract: The growing sophistication of synthetic image and deepfake generation models has turned source attribution and authenticity verification into a critical challenge for modern computer vision systems. Recent studies suggest that diffusion pipelines unintentionally imprint persistent statistical traces, known as signal leaks, within their outputs, particularly in latent representations. Building on this observation, we propose Proto-LeakNet, a signal-leak-aware and interpretable attribution framework that integrates closed-set classification with a density-based open-set evaluation on the learned embeddings, enabling analysis of unseen generators without retraining. Operating in the latent domain of diffusion models, our method re-simulates partial forward diffusion to expose residual generator-specific cues. A temporal attention encoder aggregates multi-step latent features, while a feature-weighted prototype head structures the embedding space and enables transparent attribution. Trained solely on closed data and achieving a Macro AUC of 98.13%, Proto-LeakNet learns a latent geometry that remains robust under post-processing, surpassing state-of-the-art methods, and achieves strong separability between known and unseen generators. These results demonstrate that modeling signal-leak bias in latent space enables reliable and interpretable AI-image and deepfake forensics. The code for the whole work will be available upon submission.

[120] DINOv2 Driven Gait Representation Learning for Video-Based Visible-Infrared Person Re-identification

Yujie Yang, Shuang Li, Jun Ye, Neng Dong, Fan Li, Huafeng Li

Main category: cs.CV

TL;DR: Proposes DinoGRL framework using DINOv2 to learn gait features for video-based visible-infrared person re-identification, achieving state-of-the-art performance.

Details

Motivation: Existing VVI-ReID methods focus on modality-invariant visual features but overlook gait features, which are modality-invariant and rich in temporal dynamics, limiting spatiotemporal consistency modeling for cross-modal video matching.

Method: DINOv2-Driven Gait Representation Learning (DinoGRL) with Semantic-Aware Silhouette and Gait Learning (SASGL) to generate silhouette representations using DINOv2 priors, and Progressive Bidirectional Multi-Granularity Enhancement (PBMGE) module for bidirectional interactions between gait and appearance streams.

Result: Extensive experiments on HITSZ-VCM and BUPT datasets demonstrate superiority, significantly outperforming existing state-of-the-art methods.

Conclusion: The proposed framework effectively leverages gait features complementary to appearance cues, achieving robust sequence-level representations for cross-modal retrieval in VVI-ReID.

Abstract: Video-based Visible-Infrared person re-identification (VVI-ReID) aims to retrieve the same pedestrian across visible and infrared modalities from video sequences. Existing methods tend to exploit modality-invariant visual features but largely overlook gait features, which are not only modality-invariant but also rich in temporal dynamics, thus limiting their ability to model the spatiotemporal consistency essential for cross-modal video matching. To address these challenges, we propose a DINOv2-Driven Gait Representation Learning (DinoGRL) framework that leverages the rich visual priors of DINOv2 to learn gait features complementary to appearance cues, facilitating robust sequence-level representations for cross-modal retrieval. Specifically, we introduce a Semantic-Aware Silhouette and Gait Learning (SASGL) model, which generates and enhances silhouette representations with general-purpose semantic priors from DINOv2 and jointly optimizes them with the ReID objective to achieve semantically enriched and task-adaptive gait feature learning. Furthermore, we develop a Progressive Bidirectional Multi-Granularity Enhancement (PBMGE) module, which progressively refines feature representations by enabling bidirectional interactions between gait and appearance streams across multiple spatial granularities, fully leveraging their complementarity to enhance global representations with rich local details and produce highly discriminative features. Extensive experiments on HITSZ-VCM and BUPT datasets demonstrate the superiority of our approach, significantly outperforming existing state-of-the-art methods.

[121] FastGS: Training 3D Gaussian Splatting in 100 Seconds

Shiwei Ren, Tianci Wen, Yongchun Fang, Biao Lu

Main category: cs.CV

TL;DR: FastGS is a novel acceleration framework for 3D Gaussian splatting that uses multi-view consistency to regulate Gaussian density, achieving 3.32×-15.45× training speedup while maintaining rendering quality.

Details

Motivation: Current 3DGS acceleration methods fail to properly regulate Gaussian numbers during training, causing redundant computational overhead and inefficient training.

Method: Proposes a densification and pruning strategy based on multi-view consistency, eliminating the need for budgeting mechanisms to efficiently manage Gaussian importance.

Result: Achieves 3.32× training acceleration on Mip-NeRF 360 and 15.45× acceleration on Deep Blending compared to vanilla 3DGS, with comparable rendering quality to state-of-the-art methods.

Conclusion: FastGS demonstrates strong generality across various tasks (dynamic scenes, surface reconstruction, SLAM) with 2-7× acceleration, providing an efficient solution to the training time vs. rendering quality trade-off.

Abstract: The dominant 3D Gaussian splatting (3DGS) acceleration methods fail to properly regulate the number of Gaussians during training, causing redundant computational time overhead. In this paper, we propose FastGS, a novel, simple, and general acceleration framework that fully considers the importance of each Gaussian based on multi-view consistency, efficiently solving the trade-off between training time and rendering quality. We innovatively design a densification and pruning strategy based on multi-view consistency, dispensing with the budgeting mechanism. Extensive experiments on Mip-NeRF 360, Tanks & Temples, and Deep Blending datasets demonstrate that our method significantly outperforms the state-of-the-art methods in training speed, achieving a 3.32$\times$ training acceleration and comparable rendering quality compared with DashGaussian on the Mip-NeRF 360 dataset and a 15.45$\times$ acceleration compared with vanilla 3DGS on the Deep Blending dataset. We demonstrate that FastGS exhibits strong generality, delivering 2-7$\times$ training acceleration across various tasks, including dynamic scene reconstruction, surface reconstruction, sparse-view reconstruction, large-scale reconstruction, and simultaneous localization and mapping. The project page is available at https://fastgs.github.io/

[122] Vision Foundation Models in Agriculture: Toward Domain-Specific Adaptation for Weed Herbicide Trials Assessment

Leire Benito-Del-Valle, Artzai Picón, Daniel Mugica, Manuel Ramos, Eva Portillo, Javier Romero, Carlos Javier Jimenez, Ramón Navarra-Mestre

Main category: cs.CV

TL;DR: Domain-specific vision foundation model adapted for herbicide trials outperforms general-purpose models in species identification and damage classification, especially under unseen conditions and domain shifts, while reducing annotation requirements.

Details

Motivation: General-purpose vision foundation models have limited performance in agriculture due to fine-grained distinctions needed between plant species and herbicide damage types in diverse environments.

Method: Adapt a general-purpose vision foundation model using self-supervised learning on a large curated agricultural dataset to learn transferable representations optimized for herbicide trial images.

Result: Significant improvements in species identification (F1 from 0.91 to 0.94) and damage classification (F1 from 0.26 to 0.33), with even greater gains under unseen conditions and domain shifts. Achieves 5.4% higher F1 with 80% fewer labeled samples.

Conclusion: Domain-specific foundation models demonstrate strong generalization capabilities and can significantly reduce manual annotation efforts, offering scalable automated solutions for herbicide trial analysis.

Abstract: Herbicide field trials require accurate identification of plant species and assessment of herbicide-induced damage across diverse environments. While general-purpose vision foundation models have shown promising results in complex visual domains, their performance can be limited in agriculture, where fine-grained distinctions between species and damage types are critical. In this work, we adapt a general-purpose vision foundation model to herbicide trial characterization. Trained using a self-supervised learning approach on a large, curated agricultural dataset, the model learns rich and transferable representations optimized for herbicide trials images. Our domain-specific model significantly outperforms the best general-purpose foundation model in both species identification (F1 score improvement from 0.91 to 0.94) and damage classification (from 0.26 to 0.33). Under unseen conditions (new locations and other time), it achieves even greater gains (species identification from 0.56 to 0.66; damage classification from 0.17 to 0.27). In domain-shift scenarios, such as drone imagery, it maintains strong performance (species classification from 0.49 to 0.60). Additionally, we show that domain-specific pretraining enhances segmentation accuracy, particularly in low-annotation regimes. An annotation-efficiency analysis reveals that, under unseen conditions, the domain-specific model achieves 5.4% higher F1 score than the general-purpose model, while using 80% fewer labeled samples. These results demonstrate the generalization capabilities of domain-specific foundation models and their potential to significantly reduce manual annotation efforts, offering a scalable and automated solution for herbicide trial analysis.

[123] RISE-T2V: Rephrasing and Injecting Semantics with LLM for Expansive Text-to-Video Generation

Xiangjun Zhang, Litong Gong, Yinglin Zheng, Yansong Liu, Wentao Jiang, Mingyi Xu, Biao Wang, Tiezheng Ge, Ming Zeng

Main category: cs.CV

TL;DR: RISE-T2V integrates prompt rephrasing and semantic feature extraction into a single step using a Rephrasing Adapter, enabling text-to-video models to generate higher quality videos from concise prompts by better understanding user intent.

Details

Motivation: Current text-to-video diffusion models fail to maintain video quality with concise prompts due to limited textual semantics understanding, and cannot rephrase prompts online to better align with user intentions.

Method: Introduces RISE-T2V with a Rephrasing Adapter that uses text hidden states during LLM’s next token prediction as video generation condition, implicitly rephrasing basic prompts into comprehensive representations.

Result: RISE-T2V significantly enhances T2V model capabilities, applicable to various video diffusion architectures, generating high-quality videos that better align with user intent.

Conclusion: RISE-T2V is a versatile framework that improves text-to-video generation by integrating prompt rephrasing and semantic understanding in a unified approach.

Abstract: Most text-to-video(T2V) diffusion models depend on pre-trained text encoders for semantic alignment, yet they often fail to maintain video quality when provided with concise prompts rather than well-designed ones. The primary issue lies in their limited textual semantics understanding. Moreover, these text encoders cannot rephrase prompts online to better align with user intentions, which limits both the scalability and usability of the models, To address these challenges, we introduce RISE-T2V, which uniquely integrates the processes of prompt rephrasing and semantic feature extraction into a single and seamless step instead of two separate steps. RISE-T2V is universal and can be applied to various pre-trained LLMs and video diffusion models(VDMs), significantly enhancing their capabilities for T2V tasks. We propose an innovative module called the Rephrasing Adapter, enabling diffusion models to utilize text hidden states during the next token prediction of the LLM as a condition for video generation. By employing a Rephrasing Adapter, the video generation model can implicitly rephrase basic prompts into more comprehensive representations that better match the user’s intent. Furthermore, we leverage the powerful capabilities of LLMs to enable video generation models to accomplish a broader range of T2V tasks. Extensive experiments demonstrate that RISE-T2V is a versatile framework applicable to different video diffusion model architectures, significantly enhancing the ability of T2V models to generate high-quality videos that align with user intent. Visual results are available on the webpage at https://rise-t2v.github.io.

[124] Submanifold Sparse Convolutional Networks for Automated 3D Segmentation of Kidneys and Kidney Tumours in Computed Tomography

Saúl Alonso-Monsalve, Leigh H. Whitehead, Adam Aurisano, Lorena Escudero Sanchez

Main category: cs.CV

TL;DR: Proposes a two-stage method using voxel sparsification and submanifold sparse convolutional networks for automated 3D tumor segmentation in CT scans, achieving state-of-the-art accuracy with significantly reduced computational resources.

Details

Motivation: Manual tumor delineation in radiological images is specialized, time-consuming, and a bottleneck for clinical quantitative analysis. Current 3D segmentation methods face impracticality due to large voxel counts, requiring downsampling or patches.

Method: Two-stage approach combining voxel sparsification and submanifold sparse convolutional networks, enabling high-resolution 3D segmentation without traditional computational constraints.

Result: Achieved competitive results on KiTS23 renal cancer dataset: 95.8% Dice for kidneys+masses, 85.7% for tumors+cysts, 80.3% for tumors alone. Computational improvements: 60% reduction in inference time and 75% reduction in VRAM usage compared to dense architectures.

Conclusion: The method enables high-resolution 3D tumor segmentation with state-of-the-art accuracy while significantly reducing computational requirements, making automated segmentation more practical for clinical applications.

Abstract: The accurate delineation of tumours in radiological images like Computed Tomography is a very specialised and time-consuming task, and currently a bottleneck preventing quantitative analyses to be performed routinely in the clinical setting. For this reason, developing methods for the automated segmentation of tumours in medical imaging is of the utmost importance and has driven significant efforts in recent years. However, challenges regarding the impracticality of 3D scans, given the large amount of voxels to be analysed, usually requires the downsampling of such images or using patches thereof when applying traditional convolutional neural networks. To overcome this problem, in this paper we propose a new methodology that uses, divided into two stages, voxel sparsification and submanifold sparse convolutional networks. This method allows segmentations to be performed with high-resolution inputs and a native 3D model architecture, obtaining state-of-the-art accuracies while significantly reducing the computational resources needed in terms of GPU memory and time. We studied the deployment of this methodology in the context of Computed Tomography images of renal cancer patients from the KiTS23 challenge, and our method achieved results competitive with the challenge winners, with Dice similarity coefficients of 95.8% for kidneys + masses, 85.7% for tumours + cysts, and 80.3% for tumours alone. Crucially, our method also offers significant computational improvements, achieving up to a 60% reduction in inference time and up to a 75% reduction in VRAM usage compared to an equivalent dense architecture, across both CPU and various GPU cards tested.

[125] Comparative Study of CNN Architectures for Binary Classification of Horses and Motorcycles in the VOC 2008 Dataset

Muhammad Annas Shaikh, Hamza Zaman, Arbaz Asif

Main category: cs.CV

TL;DR: Evaluation of 9 CNN architectures for horse/motorcycle classification on VOC 2008 dataset, addressing class imbalance with data augmentation. ConvNeXt-Tiny achieved best performance with 95.53% AP for horses and 89.12% for motorcycles.

Details

Motivation: To address class imbalance problems in binary classification and evaluate modern CNN architectures for object detection tasks.

Method: Implemented minority-class augmentation techniques and compared 9 architectures including ResNet-50, ConvNeXt-Tiny, DenseNet-121, and Vision Transformer using multiple performance metrics.

Result: Substantial performance variations observed; ConvNeXt-Tiny achieved highest AP (95.53% for horses, 89.12% for motorcycles); data augmentation significantly improved minority class detection, especially for deeper architectures.

Conclusion: Provides insights for architecture selection in imbalanced binary classification and quantifies data augmentation’s impact in mitigating class imbalance issues in object detection.

Abstract: This paper presents a comprehensive evaluation of nine convolutional neural network architectures for binary classification of horses and motorcycles in the VOC 2008 dataset. We address the significant class imbalance problem by implementing minority-class augmentation techniques. Our experiments compare modern architectures including ResNet-50, ConvNeXt-Tiny, DenseNet-121, and Vision Transformer across multiple performance metrics. Results demonstrate substantial performance variations, with ConvNeXt-Tiny achieving the highest Average Precision (AP) of 95.53% for horse detection and 89.12% for motorcycle detection. We observe that data augmentation significantly improves minority class detection, particularly benefiting deeper architectures. This study provides insights into architecture selection for imbalanced binary classification tasks and quantifies the impact of data augmentation strategies in mitigating class imbalance issues in object detection.

[126] Evaluating the Impact of Weather-Induced Sensor Occlusion on BEVFusion for 3D Object Detection

Sanjay Kumar, Tim Brophy, Eoin Martino Grua, Ganesh Sistu, Valentina Donzella, Ciaran Eising

Main category: cs.CV

TL;DR: Study analyzes impact of sensor occlusions on 3D object detection using BEVFusion architecture, showing LiDAR is more critical than cameras for maintaining detection accuracy under adverse conditions.

Details

Motivation: To investigate how sensor occlusions (from fog, haze, obstructions) affect 3D object detection performance in automated vehicles, particularly for BEV-based fusion methods where occlusion impacts remain underexplored.

Method: Used BEVFusion architecture on nuScenes dataset, evaluating camera and LiDAR sensors separately and in fusion under simulated occlusion conditions, measuring performance with mAP and NDS metrics.

Result: Camera-only detection drops 41.3% under moderate occlusion; LiDAR drops 47.3% only under heavy occlusion; fused detection shows minor 4.1% drop with camera occlusion but significant 26.8% drop with LiDAR occlusion, revealing stronger reliance on LiDAR.

Conclusion: BEVFusion relies more heavily on LiDAR than cameras for 3D detection, highlighting need for occlusion-aware evaluation and improved fusion techniques to handle sensor degradation in adverse conditions.

Abstract: Accurate 3D object detection is essential for automated vehicles to navigate safely in complex real-world environments. Bird’s Eye View (BEV) representations, which project multi-sensor data into a top-down spatial format, have emerged as a powerful approach for robust perception. Although BEV-based fusion architectures have demonstrated strong performance through multimodal integration, the effects of sensor occlusions, caused by environmental conditions such as fog, haze, or physical obstructions, on 3D detection accuracy remain underexplored. In this work, we investigate the impact of occlusions on both camera and Light Detection and Ranging (LiDAR) outputs using the BEVFusion architecture, evaluated on the nuScenes dataset. Detection performance is measured using mean Average Precision (mAP) and the nuScenes Detection Score (NDS). Our results show that moderate camera occlusions lead to a 41.3% drop in mAP (from 35.6% to 20.9%) when detection is based only on the camera. On the other hand, LiDAR sharply drops in performance only under heavy occlusion, with mAP falling by 47.3% (from 64.7% to 34.1%), with a severe impact on long-range detection. In fused settings, the effect depends on which sensor is occluded: occluding the camera leads to a minor 4.1% drop (from 68.5% to 65.7%), while occluding LiDAR results in a larger 26.8% drop (to 50.1%), revealing the model’s stronger reliance on LiDAR for the task of 3D object detection. Our results highlight the need for future research into occlusion-aware evaluation methods and improved sensor fusion techniques that can maintain detection accuracy in the presence of partial sensor failure or degradation due to adverse environmental conditions.

[127] A MATLAB tutorial on deep feature extraction combined with chemometrics for analytical applications

Puneet Mishra, Martijntje Vollebregt, Yizhou Ma, Maria Font-i-Furnols

Main category: cs.CV

TL;DR: This tutorial provides step-by-step guidance for using existing deep learning models to extract spatial features from imaging data in analytical chemistry, with MATLAB code demonstrations for various imaging modalities.

Details

Motivation: Deep learning can enhance spatial information extraction from imaging data in analytical chemistry, but adoption is limited due to lack of structured guidance for implementing existing models.

Method: Provides step-by-step tutorial with MATLAB code demonstrations for using pre-trained deep learning models to extract deep features from various imaging modalities, focusing on feature extraction rather than model training.

Result: A practical guide that enables analytical chemists to apply deep learning approaches to extract spatial information from imaging data and integrate it with other data sources like spectral information.

Conclusion: This tutorial bridges the gap in deep learning adoption for analytical chemistry by providing accessible, structured guidance for extracting spatial features from imaging data using existing open-source models.

Abstract: Background In analytical chemistry, spatial information about materials is commonly captured through imaging techniques, such as traditional color cameras or with advanced hyperspectral cameras and microscopes. However, efficiently extracting and analyzing this spatial information for exploratory and predictive purposes remains a challenge, especially when using traditional chemometric methods. Recent advances in deep learning and artificial intelligence have significantly enhanced image processing capabilities, enabling the extraction of multiscale deep features that are otherwise challenging to capture with conventional image processing techniques. Despite the wide availability of open-source deep learning models, adoption in analytical chemistry remains limited because of the absence of structured, step-by-step guidance for implementing these models. Results This tutorial aims to bridge this gap by providing a step-by-step guide for applying deep learning approaches to extract spatial information from imaging data and integrating it with other data sources, such as spectral information. Importantly, the focus of this work is not on training deep learning models for image processing but on using existing open source models to extract deep features from imaging data. Significance The tutorial provides MATLAB code tutorial demonstrations, showcasing the processing of imaging data from various imaging modalities commonly encountered in analytical chemistry. Readers must run the tutorial steps on their own datasets using the codes presented in this tutorial.

[128] Multi-Task Learning for Visually Grounded Reasoning in Gastrointestinal VQA

Itbaan Safwan, Muhammad Annas Shaikh, Muhammad Haaris, Ramail Khan, Muhammad Atif Tahir

Main category: cs.CV

TL;DR: Multi-task framework using LoRA-tuned Florence-2 model for medical VQA, explanation generation, and visual grounding, achieving improved accuracy and interpretability over single-task baselines.

Details

Motivation: To develop a comprehensive medical VQA system that produces both accurate answers and interpretable explanations with visual grounding for better medical reasoning.

Method: Uses LoRA-tuned Florence-2 model with three curated datasets: Kvasir-VQA-x1 for QA, synthetic explanation dataset for medical reasoning, and text-to-region pairs for visual grounding in a multi-task learning setup.

Result: Substantial improvements over single-task baselines in both answer accuracy and visual localization, demonstrating effectiveness of grounded multi-task learning.

Conclusion: Multi-task learning with visual grounding enables more accurate and interpretable medical VQA systems, highlighting the value of joint learning across visual reasoning tasks.

Abstract: We present a multi-task framework for the MediaEval Medico 2025 challenge, leveraging a LoRA-tuned Florence-2 model for simultaneous visual question answering (VQA), explanation generation, and visual grounding. The proposed system integrates three curated datasets: (1) Kvasir-VQA-x1 for question-answer learning, (2) a synthetically enriched explanation dataset offering structured medical reasoning, and (3) text-to-region pairs linking visual features with segmentation masks. This multi-task setup enables the model to jointly learn visual grounding, reasoning, and interpretation, producing responses that are both accurate and interpretable. Extensive evaluation demonstrates that our approach substantially improves over single-task baselines in both answer accuracy and visual localization, highlighting the effectiveness of grounded multi-task learning for medical VQA applications.

Chang Liu, Juan Li, Sheng Zhang, Chang Liu, Jie Li, Xu Zhang

Main category: cs.CV

TL;DR: BoRe-Depth is a lightweight monocular depth estimation model with 8.7M parameters that achieves 50.7 FPS on embedded systems while improving boundary quality through enhanced feature fusion and semantic integration.

Details

Motivation: Existing monocular depth estimation methods suffer from poor performance and blurred object boundaries on embedded systems, despite the low-cost advantage of monocular approaches for 3D perception in unmanned systems.

Method: Proposed BoRe-Depth model with Enhanced Feature Adaptive Fusion Module (EFAF) for adaptive depth feature fusion to enhance boundary details, and integration of semantic knowledge into the encoder to improve object recognition and boundary perception.

Result: Achieves 50.7 FPS on NVIDIA Jetson Orin, significantly outperforms previous lightweight models on multiple challenging datasets, and provides improved boundary quality in depth maps.

Conclusion: BoRe-Depth demonstrates efficient and accurate monocular depth estimation on embedded systems with enhanced boundary representation, making it suitable for real-time 3D perception applications in unmanned systems.

Abstract: Depth estimation is one of the key technologies for realizing 3D perception in unmanned systems. Monocular depth estimation has been widely researched because of its low-cost advantage, but the existing methods face the challenges of poor depth estimation performance and blurred object boundaries on embedded systems. In this paper, we propose a novel monocular depth estimation model, BoRe-Depth, which contains only 8.7M parameters. It can accurately estimate depth maps on embedded systems and significantly improves boundary quality. Firstly, we design an Enhanced Feature Adaptive Fusion Module (EFAF) which adaptively fuses depth features to enhance boundary detail representation. Secondly, we integrate semantic knowledge into the encoder to improve the object recognition and boundary perception capabilities. Finally, BoRe-Depth is deployed on NVIDIA Jetson Orin, and runs efficiently at 50.7 FPS. We demonstrate that the proposed model significantly outperforms previous lightweight models on multiple challenging datasets, and we provide detailed ablation studies for the proposed methods. The code is available at https://github.com/liangxiansheng093/BoRe-Depth.

[130] DORAEMON: A Unified Library for Visual Object Modeling and Representation Learning at Scale

Ke Du, Yimin Peng, Chao Gao, Fan Zhou, Siqiao Xue

Main category: cs.CV

TL;DR: DORAEMON is an open-source PyTorch library that unifies visual object modeling and representation learning across scales with YAML-driven workflows, 1000+ pretrained models, and reproducible recipes for classification, retrieval, and metric learning.

Details

Motivation: To consolidate datasets, models, and training techniques into one platform for rapid experimentation in visual recognition and representation learning, bridging research and deployment.

Method: Uses a single YAML-driven workflow with timm-compatible interface for 1000+ pretrained backbones, modular losses, augmentations, distributed-training utilities, and one-command export to ONNX/HuggingFace.

Result: Reproducible recipes match or exceed reference results on ImageNet-1K, MS-Celeb-1M and Stanford online products, providing a scalable foundation for visual recognition tasks.

Conclusion: DORAEMON offers a unified platform that enables efficient transfer of research advances to real-world applications through consolidated tools and reproducible workflows.

Abstract: DORAEMON is an open-source PyTorch library that unifies visual object modeling and representation learning across diverse scales. A single YAML-driven workflow covers classification, retrieval and metric learning; more than 1000 pretrained backbones are exposed through a timm-compatible interface, together with modular losses, augmentations and distributed-training utilities. Reproducible recipes match or exceed reference results on ImageNet-1K, MS-Celeb-1M and Stanford online products, while one-command export to ONNX or HuggingFace bridges research and deployment. By consolidating datasets, models, and training techniques into one platform, DORAEMON offers a scalable foundation for rapid experimentation in visual recognition and representation learning, enabling efficient transfer of research advances to real-world applications. The repository is available at https://github.com/wuji3/DORAEMON.

[131] HideAndSeg: an AI-based tool with automated prompting for octopus segmentation in natural habitats

Alan de Aguiar, Michaella Pereira Andrade, Charles Morphy D. Santos, João Paulo Gois

Main category: cs.CV

TL;DR: HideAndSeg is a minimally supervised AI tool that combines SAM2 and YOLOv11 to automatically segment octopuses in videos, using unsupervised metrics for evaluation and achieving reliable performance even after occlusions.

Details

Motivation: Analyzing octopuses in natural habitats is difficult due to their camouflage, rapid skin changes, non-rigid deformations, occlusions, and challenging underwater conditions. There's also a lack of large-scale annotated datasets for this task.

Method: Integrates SAM2 with custom-trained YOLOv11 object detector. Starts with user-provided point coordinates for initial SAM2 segmentation, then uses these masks to train YOLO model. The pipeline becomes fully automated using bounding box prompts to SAM2, eliminating manual intervention. Introduces two unsupervised metrics: temporal consistency DICE_t and new component count NC_t.

Result: HideAndSeg achieves satisfactory performance, reducing segmentation noise compared to manually prompted approach. Can re-identify and segment octopus even after complete occlusion in natural environments, where manually prompted model fails.

Conclusion: Provides a practical tool that reduces need for manual analysis in real-world scenarios, paving the way for more efficient behavioral studies of wild cephalopods.

Abstract: Analyzing octopuses in their natural habitats is challenging due to their camouflage capability, rapid changes in skin texture and color, non-rigid body deformations, and frequent occlusions, all of which are compounded by variable underwater lighting and turbidity. Addressing the lack of large-scale annotated datasets, this paper introduces HideAndSeg, a novel, minimally supervised AI-based tool for segmenting videos of octopuses. It establishes a quantitative baseline for this task. HideAndSeg integrates SAM2 with a custom-trained YOLOv11 object detector. First, the user provides point coordinates to generate the initial segmentation masks with SAM2. These masks serve as training data for the YOLO model. After that, our approach fully automates the pipeline by providing a bounding box prompt to SAM2, eliminating the need for further manual intervention. We introduce two unsupervised metrics - temporal consistency $DICE_t$ and new component count $NC_t$ - to quantitatively evaluate segmentation quality and guide mask refinement in the absence of ground-truth data, i.e., real-world information that serves to train, validate, and test AI models. Results show that HideAndSeg achieves satisfactory performance, reducing segmentation noise compared to the manually prompted approach. Our method can re-identify and segment the octopus even after periods of complete occlusion in natural environments, a scenario in which the manually prompted model fails. By reducing the need for manual analysis in real-world scenarios, this work provides a practical tool that paves the way for more efficient behavioral studies of wild cephalopods.

[132] Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang, Ming Zhang, Qiguang Chen, Tianyi Liang, Xiaomeng Hu, Yining Zheng, Xinchi Chen, Jun Zhao, Xuanjing Huang, Xipeng Qiu

Main category: cs.CV

TL;DR: Introduces “Thinking with Video” paradigm using video generation models like Sora-2 to overcome limitations of text-only and image-only reasoning, achieving strong performance on both vision-centric and text-centric tasks.

Details

Motivation: To address limitations of "Thinking with Text" and "Thinking with Images" paradigms: images capture only single moments and fail to represent dynamic processes, and the separation of text and vision hinders unified multimodal understanding.

Method: Developed VideoThinkBench with vision-centric tasks (Eyeballing Puzzles) and text-centric tasks (subsets of GSM8K, MMMU), evaluating Sora-2 video generation model as a reasoner using self-consistency and in-context learning techniques.

Result: Sora-2 performs comparably to SOTA VLMs on vision-centric tasks, surpasses VLMs on Eyeballing Games, achieves 92% accuracy on MATH and 75.53% on MMMU for text-centric tasks. Self-consistency and in-context learning improve performance.

Conclusion: Video generation models like Sora-2 serve as potential unified multimodal understanding and generation models, positioning “thinking with video” as a unified multimodal reasoning paradigm.

Abstract: “Thinking with Text” and “Thinking with Images” paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce “Thinking with Video”, a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2’s performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positions “thinking with video” as a unified multimodal reasoning paradigm.

[133] Solving Convex Partition Visual Jigsaw Puzzles

Yaniv Ohayon, Ofir Itzhak Shahar, Ohad Ben-Shahar

Main category: cs.CV

TL;DR: This paper presents a computational solver for convex partition jigsaw puzzles, expanding beyond traditional square puzzles by using geometric and pictorial compatibility measures.

Details

Motivation: Most existing jigsaw puzzle solvers only handle square puzzles, which limits practical applications. This work aims to expand computational puzzle solving to handle convex partition puzzles, a major subset of polygonal puzzles.

Method: The authors utilize both geometrical and pictorial compatibilities between puzzle pieces and introduce a greedy solver approach to assemble the convex partition puzzles.

Result: The paper reports several performance measures and introduces the first benchmark dataset specifically for convex partition puzzles.

Conclusion: This work significantly expands the types of puzzles that can be handled computationally, moving beyond the limitations of square jigsaw puzzles to address more practical convex partition puzzles.

Abstract: Jigsaw puzzle solving requires the rearrangement of unordered pieces into their original pose in order to reconstruct a coherent whole, often an image, and is known to be an intractable problem. While the possible impact of automatic puzzle solvers can be disruptive in various application domains, most of the literature has focused on developing solvers for square jigsaw puzzles, severely limiting their practical use. In this work, we significantly expand the types of puzzles handled computationally, focusing on what is known as Convex Partitions, a major subset of polygonal puzzles whose pieces are convex. We utilize both geometrical and pictorial compatibilities, introduce a greedy solver, and report several performance measures next to the first benchmark dataset of such puzzles.

[134] V-Thinker: Interactive Thinking with Images

Runqi Qiao, Qiuna Tan, Minghan Yang, Guanting Dong, Peiqing Yang, Shiqiang Lang, Enhui Wan, Xiaowan Wang, Yida Xu, Lan Yang, Chong Sun, Chen Li, Honggang Zhang

Main category: cs.CV

TL;DR: V-Thinker is a multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning, outperforming existing LMM-based methods in both general and interactive reasoning tasks.

Details

Motivation: Current LMMs have limited visual tool spaces and task-specific workflow designs, restricting their ability to deeply integrate image interaction with long-horizon reasoning capabilities.

Method: V-Thinker uses a Data Evolution Flywheel to synthesize and verify interactive reasoning datasets, and a Visual Progressive Training Curriculum with two-stage reinforcement learning that first aligns perception via point-level supervision then integrates interactive reasoning.

Result: Extensive experiments show V-Thinker consistently outperforms strong LMM-based baselines in both general and interactive reasoning scenarios.

Conclusion: V-Thinker provides valuable insights for advancing image-interactive reasoning applications and represents a significant step in enabling interactive, vision-centric thinking in multimodal models.

Abstract: Empowering Large Multimodal Models (LMMs) to deeply integrate image interaction with long-horizon reasoning capabilities remains a long-standing challenge in this field. Recent advances in vision-centric reasoning explore a promising “Thinking with Images” paradigm for LMMs, marking a shift from image-assisted reasoning to image-interactive thinking. While this milestone enables models to focus on fine-grained image regions, progress remains constrained by limited visual tool spaces and task-specific workflow designs. To bridge this gap, we present V-Thinker, a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning. V-Thinker comprises two key components: (1) a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies interactive reasoning datasets across three dimensions-diversity, quality, and difficulty; and (2) a Visual Progressive Training Curriculum that first aligns perception via point-level supervision, then integrates interactive reasoning through a two-stage reinforcement learning framework. Furthermore, we introduce VTBench, an expert-verified benchmark targeting vision-centric interactive reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently outperforms strong LMM-based baselines in both general and interactive reasoning scenarios, providing valuable insights for advancing image-interactive reasoning applications.

[135] Landslide Hazard Mapping with Geospatial Foundation Models: Geographical Generalizability, Data Scarcity, and Band Adaptability

Wenwen Li, Sizhe Wang, Hyunho Lee, Chenyan Lu, Sujit Roy, Rahul Ramachandran, Chia-Yu Hsu

Main category: cs.CV

TL;DR: GeoFMs, particularly Prithvi-EO-2.0, outperform traditional models for landslide mapping across sensors, regions, and limited data scenarios through a three-axis framework of sensor, label, and domain adaptation.

Details

Motivation: Landslides cause severe damage, but conventional deep learning models struggle with cross-sensor, cross-region applications and limited training data, requiring more robust solutions.

Method: Three-axis analytical framework (sensor, label, domain) for adapting geospatial foundation models, focusing on Prithvi-EO-2.0 with global pretraining, self-supervision, and adaptable fine-tuning.

Result: Consistently outperforms task-specific CNNs (U-Net, U-Net++), vision transformers (Segformer, SwinV2-B), and other GeoFMs (TerraMind, SatMAE); resilient to spectral variation, maintains accuracy under label scarcity, and generalizes well across diverse datasets.

Conclusion: GeoFMs represent a step toward more robust and scalable approaches for landslide risk reduction, though challenges remain in computational cost and limited reusable AI-ready training data.

Abstract: Landslides cause severe damage to lives, infrastructure, and the environment, making accurate and timely mapping essential for disaster preparedness and response. However, conventional deep learning models often struggle when applied across different sensors, regions, or under conditions of limited training data. To address these challenges, we present a three-axis analytical framework of sensor, label, and domain for adapting geospatial foundation models (GeoFMs), focusing on Prithvi-EO-2.0 for landslide mapping. Through a series of experiments, we show that it consistently outperforms task-specific CNNs (U-Net, U-Net++), vision transformers (Segformer, SwinV2-B), and other GeoFMs (TerraMind, SatMAE). The model, built on global pretraining, self-supervision, and adaptable fine-tuning, proved resilient to spectral variation, maintained accuracy under label scarcity, and generalized more reliably across diverse datasets and geographic settings. Alongside these strengths, we also highlight remaining challenges such as computational cost and the limited availability of reusable AI-ready training data for landslide research. Overall, our study positions GeoFMs as a step toward more robust and scalable approaches for landslide risk reduction and environmental monitoring.

[136] THEval. Evaluation Framework for Talking Head Video Generation

Nabyl Quignon, Baptiste Chopin, Yaohui Wang, Antitza Dantcheva

Main category: cs.CV

TL;DR: Proposes a comprehensive evaluation framework with 8 metrics across quality, naturalness, and synchronization dimensions to address the gap in assessing talking head video generation.

Details

Motivation: Current evaluation metrics for talking head generation are limited and lag behind rapid advances in video generation technology, necessitating more comprehensive assessment methods.

Method: Developed an evaluation framework with 8 metrics focusing on fine-grained dynamics of head, mouth, eyebrows, and face quality, emphasizing efficiency and human preference alignment.

Result: Extensive experiments on 85,000 videos from 17 state-of-the-art models revealed that while algorithms excel in lip synchronization, they struggle with generating expressiveness and artifact-free details.

Conclusion: The proposed benchmark framework provides comprehensive evaluation for generative methods and will be publicly released with regular updates to track field progress.

Abstract: Video generation has achieved remarkable progress, with generated videos increasingly resembling real ones. However, the rapid advance in generation has outpaced the development of adequate evaluation metrics. Currently, the assessment of talking head generation primarily relies on limited metrics, evaluating general video quality, lip synchronization, and on conducting user studies. Motivated by this, we propose a new evaluation framework comprising 8 metrics related to three dimensions (i) quality, (ii) naturalness, and (iii) synchronization. In selecting the metrics, we place emphasis on efficiency, as well as alignment with human preferences. Based on this considerations, we streamline to analyze fine-grained dynamics of head, mouth, and eyebrows, as well as face quality. Our extensive experiments on 85,000 videos generated by 17 state-of-the-art models suggest that while many algorithms excel in lip synchronization, they face challenges with generating expressiveness and artifact-free details. These videos were generated based on a novel real dataset, that we have curated, in order to mitigate bias of training data. Our proposed benchmark framework is aimed at evaluating the improvement of generative methods. Original code, dataset and leaderboards will be publicly released and regularly updated with new methods, in order to reflect progress in the field.

[137] Learning from Single Timestamps: Complexity Estimation in Laparoscopic Cholecystectomy

Dimitrios Anastasiou, Santiago Barbarisi, Lucy Culshaw, Jayna Patel, Evangelos B. Mazomenos, Imanol Luengo, Danail Stoyanov

Main category: cs.CV

TL;DR: STC-Net is a novel framework for automated surgical complexity assessment in Laparoscopic Cholecystectomy using the Parkland Grading Scale, operating on full videos with weak temporal supervision.

Details

Motivation: Accurate assessment of surgical complexity is essential in LC, where severe inflammation affects operative times and complication risks. Current methods require manual curation, limiting scalability.

Method: STC-Net performs joint temporal localization and grading through localization, window proposal, and grading modules with novel loss combining hard/soft localization and background-aware supervision.

Result: On 1,859 LC videos, STC-Net achieved 62.11% accuracy and 61.42% F1-score, outperforming non-localized baselines by over 10% in both metrics.

Conclusion: STC-Net provides a scalable approach for automated PGS-based complexity estimation from full surgical videos, promising for post-operative analysis and surgical training.

Abstract: Purpose: Accurate assessment of surgical complexity is essential in Laparoscopic Cholecystectomy (LC), where severe inflammation is associated with longer operative times and increased risk of postoperative complications. The Parkland Grading Scale (PGS) provides a clinically validated framework for stratifying inflammation severity; however, its automation in surgical videos remains largely unexplored, particularly in realistic scenarios where complete videos must be analyzed without prior manual curation. Methods: In this work, we introduce STC-Net, a novel framework for SingleTimestamp-based Complexity estimation in LC via the PGS, designed to operate under weak temporal supervision. Unlike prior methods limited to static images or manually trimmed clips, STC-Net operates directly on full videos. It jointly performs temporal localization and grading through a localization, window proposal, and grading module. We introduce a novel loss formulation combining hard and soft localization objectives and background-aware grading supervision. Results: Evaluated on a private dataset of 1,859 LC videos, STC-Net achieves an accuracy of 62.11% and an F1-score of 61.42%, outperforming non-localized baselines by over 10% in both metrics and highlighting the effectiveness of weak supervision for surgical complexity assessment. Conclusion: STC-Net demonstrates a scalable and effective approach for automated PGS-based surgical complexity estimation from full LC videos, making it promising for post-operative analysis and surgical training.

[138] UniSplat: Unified Spatio-Temporal Fusion via 3D Latent Scaffolds for Dynamic Driving Scene Reconstruction

Chen Shi, Shaoshuai Shi, Xiaoyang Lyu, Chunyang Liu, Kehua Sheng, Bo Zhang, Li Jiang

Main category: cs.CV

TL;DR: UniSplat is a feed-forward framework for dynamic 3D scene reconstruction in autonomous driving that uses unified latent spatio-temporal fusion to handle sparse, non-overlapping camera views and complex dynamics.

Details

Motivation: Existing methods struggle with sparse, non-overlapping camera views and complex scene dynamics in autonomous driving scenarios, requiring a more robust reconstruction approach.

Method: Constructs a 3D latent scaffold using pretrained foundation models, employs efficient spatio-temporal fusion within the scaffold, and uses a dual-branch decoder combining point-anchored refinement with voxel-based generation to create dynamic-aware Gaussians.

Result: Achieves state-of-the-art performance in novel view synthesis and provides robust, high-quality renderings even for viewpoints outside original camera coverage on real-world datasets.

Conclusion: UniSplat effectively addresses the challenges of dynamic scene reconstruction in autonomous driving through unified spatio-temporal fusion and persistent memory, enabling complete reconstructions beyond current camera coverage.

Abstract: Feed-forward 3D reconstruction for autonomous driving has advanced rapidly, yet existing methods struggle with the joint challenges of sparse, non-overlapping camera views and complex scene dynamics. We present UniSplat, a general feed-forward framework that learns robust dynamic scene reconstruction through unified latent spatio-temporal fusion. UniSplat constructs a 3D latent scaffold, a structured representation that captures geometric and semantic scene context by leveraging pretrained foundation models. To effectively integrate information across spatial views and temporal frames, we introduce an efficient fusion mechanism that operates directly within the 3D scaffold, enabling consistent spatio-temporal alignment. To ensure complete and detailed reconstructions, we design a dual-branch decoder that generates dynamic-aware Gaussians from the fused scaffold by combining point-anchored refinement with voxel-based generation, and maintain a persistent memory of static Gaussians to enable streaming scene completion beyond current camera coverage. Extensive experiments on real-world datasets demonstrate that UniSplat achieves state-of-the-art performance in novel view synthesis, while providing robust and high-quality renderings even for viewpoints outside the original camera coverage.

[139] Building Trust in Virtual Immunohistochemistry: Automated Assessment of Image Quality

Tushar Kataria, Shikha Dubey, Mary Bronner, Jolanta Jedrzkiewicz, Ben J. Brintz, Shireen Y. Elhabian, Beatrice S. Knudsen

Main category: cs.CV

TL;DR: The paper introduces an automated framework to evaluate virtual IHC stain quality using stain accuracy metrics rather than conventional image fidelity metrics, showing that paired models perform best and WSI-level evaluation is crucial.

Details

Motivation: Current evaluation methods for virtual IHC stains focus on image fidelity rather than staining accuracy, making it difficult to assess whether models correctly identify IHC-positive pixels for clinical use.

Method: Used color deconvolution to generate masks of IHC-positive pixels from real and virtual IHC images, then computed stain accuracy metrics (Dice, IoU, Hausdorff distance) to quantify pixel-level labeling accuracy without manual annotations.

Result: Conventional metrics (FID, PSNR, SSIM) correlate poorly with stain accuracy and pathologist assessment. Paired models (PyramidPix2Pix, AdaptiveNCE) achieved highest stain accuracy, while unpaired diffusion/GAN models were less reliable. WSI evaluations revealed performance issues not visible in patch-based analysis.

Conclusion: The framework provides a reproducible approach for assessing virtual IHC model quality, which is essential for accelerating clinical translation and routine use by pathologists.

Abstract: Deep learning models can generate virtual immunohistochemistry (IHC) stains from hematoxylin and eosin (H&E) images, offering a scalable and low-cost alternative to laboratory IHC. However, reliable evaluation of image quality remains a challenge as current texture- and distribution-based metrics quantify image fidelity rather than the accuracy of IHC staining. Here, we introduce an automated and accuracy grounded framework to determine image quality across sixteen paired or unpaired image translation models. Using color deconvolution, we generate masks of pixels stained brown (i.e., IHC-positive) as predicted by each virtual IHC model. We use the segmented masks of real and virtual IHC to compute stain accuracy metrics (Dice, IoU, Hausdorff distance) that directly quantify correct pixel - level labeling without needing expert manual annotations. Our results demonstrate that conventional image fidelity metrics, including Frechet Inception Distance (FID), peak signal-to-noise ratio (PSNR), and structural similarity (SSIM), correlate poorly with stain accuracy and pathologist assessment. Paired models such as PyramidPix2Pix and AdaptiveNCE achieve the highest stain accuracy, whereas unpaired diffusion- and GAN-based models are less reliable in providing accurate IHC positive pixel labels. Moreover, whole-slide images (WSI) reveal performance declines that are invisible in patch-based evaluations, emphasizing the need for WSI-level benchmarks. Together, this framework defines a reproducible approach for assessing the quality of virtual IHC models, a critical step to accelerate translation towards routine use by pathologists.

[140] NovisVQ: A Streaming Convolutional Neural Network for No-Reference Opinion-Unaware Frame Quality Assessment

Kylie Cancilla, Alexander Moore, Amar Saini, Carmen Carrano

Main category: cs.CV

TL;DR: A streaming-based no-reference video quality assessment model that uses temporal modeling and synthetic degradations to predict full-reference metrics without requiring clean references or human labels.

Details

Motivation: Existing VQA methods have limitations: full-reference metrics need clean reference videos, while most no-reference models rely on expensive human opinion labels. Image-based opinion-unaware methods ignore temporal context crucial for video analysis.

Method: Leverages synthetic degradations of DAVIS dataset to train a temporal-aware convolutional architecture that predicts FR metrics (LPIPS, PSNR, SSIM) directly from degraded video without references during inference.

Result: Outperforms image-based baseline by generalizing across diverse degradations, and achieves higher correlation with full-reference metrics compared to BRISQUE, demonstrating the value of temporal modeling.

Conclusion: The streaming approach provides scalable, opinion-unaware VQA that effectively captures temporal context for real-world vision systems, offering a practical solution without requiring clean references or human labels.

Abstract: Video quality assessment (VQA) is vital for computer vision tasks, but existing approaches face major limitations: full-reference (FR) metrics require clean reference videos, and most no-reference (NR) models depend on training on costly human opinion labels. Moreover, most opinion-unaware NR methods are image-based, ignoring temporal context critical for video object detection. In this work, we present a scalable, streaming-based VQA model that is both no-reference and opinion-unaware. Our model leverages synthetic degradations of the DAVIS dataset, training a temporal-aware convolutional architecture to predict FR metrics (LPIPS , PSNR, SSIM) directly from degraded video, without references at inference. We show that our streaming approach outperforms our own image-based baseline by generalizing across diverse degradations, underscoring the value of temporal modeling for scalable VQA in real-world vision systems. Additionally, we demonstrate that our model achieves higher correlation with full-reference metrics compared to BRISQUE, a widely-used opinion-aware image quality assessment baseline, validating the effectiveness of our temporal, opinion-unaware approach.

[141] Polarization-resolved imaging improves eye tracking

Mantas Žurauskas, Tom Bu, Sanaz Alali, Beyza Kalkanli, Derek Shi, Fernando Alamos, Gauresh Pandit, Christopher Mei, Ali Behrooz, Ramin Mirjalili, Dave Stronks, Alexander Fix, Dmitri Model

Main category: cs.CV

TL;DR: Polarization-enabled eye tracking (PET) uses polarization-resolved near-infrared imaging to improve eye tracking accuracy by revealing trackable features on sclera and cornea that are invisible in intensity-only images.

Details

Motivation: To enhance eye tracking by adding polarization state measurement as an additional optical contrast mechanism beyond just light intensity, enabling better tracking in challenging conditions.

Method: A PET system composed of a polarization-filter-array camera with linearly polarized near-infrared illuminator, using convolutional neural network models trained on data from 346 participants.

Result: PET reduced median 95th-percentile absolute gaze error by 10-16% compared to intensity-only baselines, showing robustness against eyelid occlusions, eye-relief changes, and pupil-size variation.

Conclusion: PET provides practical gains in human-computer interaction and represents a simple, robust sensing modality for future wearable devices by leveraging light-tissue polarization effects.

Abstract: Polarization-resolved near-infrared imaging adds a useful optical contrast mechanism to eye tracking by measuring the polarization state of light reflected by ocular tissues in addition to its intensity. In this paper we demonstrate how this contrast can be used to enable eye tracking. Specifically, we demonstrate that a polarization-enabled eye tracking (PET) system composed of a polarization–filter–array camera paired with a linearly polarized near-infrared illuminator can reveal trackable features across the sclera and gaze-informative patterns on the cornea, largely absent in intensity-only images. Across a cohort of 346 participants, convolutional neural network based machine learning models trained on data from PET reduced the median 95th-percentile absolute gaze error by 10–16% relative to capacity-matched intensity baselines under nominal conditions and in the presence of eyelid occlusions, eye-relief changes, and pupil-size variation. These results link light–tissue polarization effects to practical gains in human–computer interaction and position PET as a simple, robust sensing modality for future wearable devices.

[142] Benchmark Designers Should “Train on the Test Set” to Expose Exploitable Non-Visual Shortcuts

Ellis Brown, Jihan Yang, Shusheng Yang, Rob Fergus, Saining Xie

Main category: cs.CV

TL;DR: The paper proposes a framework to diagnose and debias multimodal benchmarks by identifying non-visual biases that allow models to perform well without strong visual understanding.

Details

Motivation: Current multimodal benchmarks can be gamed using biases and linguistic patterns rather than genuine visual understanding, undermining their reliability for evaluating MLLMs.

Method: Two-component framework: 1) Test-set Stress-Test (TsT) using LLM fine-tuning on textual inputs to reveal shortcuts, 2) Iterative Bias Pruning (IBP) to filter high-bias samples from benchmarks.

Result: Applied to four benchmarks (VSI-Bench, CV-Bench, MMMU, VideoMME), revealing pervasive non-visual biases. Created VSI-Bench-Debiased with reduced non-visual solvability and wider vision-blind performance gap.

Conclusion: Benchmark designers should proactively identify and mitigate non-visual biases using diagnostic procedures to create more robust evaluations of multimodal models.

Abstract: Robust benchmarks are crucial for evaluating Multimodal Large Language Models (MLLMs). Yet we find that models can ace many multimodal benchmarks without strong visual understanding, instead exploiting biases, linguistic priors, and superficial patterns. This is especially problematic for vision-centric benchmarks that are meant to require visual inputs. We adopt a diagnostic principle for benchmark design: if a benchmark can be gamed, it will be. Designers should therefore try to game'' their own benchmarks first, using diagnostic and debiasing procedures to systematically identify and mitigate non-visual biases. Effective diagnosis requires directly training on the test set’’ – probing the released test set for its intrinsic, exploitable patterns. We operationalize this standard with two components. First, we diagnose benchmark susceptibility using a Test-set Stress-Test'' (TsT) methodology. Our primary diagnostic tool involves fine-tuning a powerful Large Language Model via $k$-fold cross-validation on exclusively the non-visual, textual inputs of the test set to reveal shortcut performance and assign each sample a bias score $s(x)$. We complement this with a lightweight Random Forest-based diagnostic operating on hand-crafted features for fast, interpretable auditing. Second, we debias benchmarks by filtering high-bias samples using an Iterative Bias Pruning’’ (IBP) procedure. Applying this framework to four benchmarks – VSI-Bench, CV-Bench, MMMU, and VideoMME – we uncover pervasive non-visual biases. As a case study, we apply our full framework to create VSI-Bench-Debiased, demonstrating reduced non-visual solvability and a wider vision-blind performance gap than the original.

[143] SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding

Ellis Brown, Arijit Ray, Ranjay Krishna, Ross Girshick, Rob Fergus, Saining Xie

Main category: cs.CV

TL;DR: SIMS-V is a framework that uses 3D simulators to generate spatially-rich video training data for multimodal language models, enabling efficient training with minimal question types that outperform larger models on real-world spatial reasoning benchmarks.

Details

Motivation: Multimodal language models struggle with spatial reasoning across time and space, and obtaining diverse real-world video data with precise spatial annotations is challenging.

Method: Systematic data-generation framework leveraging 3D simulators’ privileged information to create spatially-rich video training data, with systematic ablations of question types, mixes, and scales.

Result: Identified three key question categories (metric measurement, perspective-dependent reasoning, temporal tracking) that enable effective transfer. A 7B-parameter model trained on 25K simulated examples outperforms 72B baseline and achieves competitive performance with proprietary models on real-world spatial reasoning benchmarks.

Conclusion: Simulated data can effectively develop transferable spatial intelligence, enabling efficient training that maintains general video understanding while substantially improving embodied and real-world spatial tasks.

Abstract: Despite impressive high-level video comprehension, multimodal language models struggle with spatial reasoning across time and space. While current spatial training approaches rely on real-world video data, obtaining diverse footage with precise spatial annotations remains a bottleneck. To alleviate this bottleneck, we present SIMS-V – a systematic data-generation framework that leverages the privileged information of 3D simulators to create spatially-rich video training data for multimodal language models. Using this framework, we investigate which properties of simulated data drive effective real-world transfer through systematic ablations of question types, mixes, and scales. We identify a minimal set of three question categories (metric measurement, perspective-dependent reasoning, and temporal tracking) that prove most effective for developing transferable spatial intelligence, outperforming comprehensive coverage despite using fewer question types. These insights enable highly efficient training: our 7B-parameter video LLM fine-tuned on just 25K simulated examples outperforms the larger 72B baseline and achieves competitive performance with proprietary models on rigorous real-world spatial reasoning benchmarks. Our approach demonstrates robust generalization, maintaining performance on general video understanding while showing substantial improvements on embodied and real-world spatial tasks.

[144] Cambrian-S: Towards Spatial Supersensing in Video

Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu, Rob Fergus, Yann LeCun, Li Fei-Fei, Saining Xie

Main category: cs.CV

TL;DR: The paper proposes a shift from reactive multimodal systems to ‘supersensing’ - a broader paradigm with four stages of spatial cognition. It introduces VSI-SUPER benchmark and shows that scale alone is insufficient, proposing predictive sensing as a solution.

Details

Motivation: Current multimodal systems are reactive and task-driven, lacking true spatial cognition and world modeling capabilities. Progress requires moving beyond linguistic-only understanding to broader spatial supersensing.

Method: Introduces VSI-SUPER benchmark with VSR (visual spatial recall) and VSC (visual spatial counting) tasks. Tests data scaling limits with VSI-590K dataset and Cambrian-S model. Proposes predictive sensing using self-supervised next-latent-frame prediction with surprise-driven memory.

Result: Achieved +30% improvement on VSI-Bench without sacrificing general capabilities, but performance on VSI-SUPER remains limited. Predictive sensing approach substantially outperforms proprietary baselines on VSI-SUPER.

Conclusion: Scale alone is insufficient for spatial supersensing. True progress requires models that anticipate, select, and organize experience through predictive sensing, not just see and react.

Abstract: We argue that progress in true multimodal intelligence calls for a shift from reactive, task-driven systems and brute-force long context towards a broader paradigm of supersensing. We frame spatial supersensing as four stages beyond linguistic-only understanding: semantic perception (naming what is seen), streaming event cognition (maintaining memory across continuous experiences), implicit 3D spatial cognition (inferring the world behind pixels), and predictive world modeling (creating internal models that filter and organize information). Current benchmarks largely test only the early stages, offering narrow coverage of spatial cognition and rarely challenging models in ways that require true world modeling. To drive progress in spatial supersensing, we present VSI-SUPER, a two-part benchmark: VSR (long-horizon visual spatial recall) and VSC (continual visual spatial counting). These tasks require arbitrarily long video inputs yet are resistant to brute-force context expansion. We then test data scaling limits by curating VSI-590K and training Cambrian-S, achieving +30% absolute improvement on VSI-Bench without sacrificing general capabilities. Yet performance on VSI-SUPER remains limited, indicating that scale alone is insufficient for spatial supersensing. We propose predictive sensing as a path forward, presenting a proof-of-concept in which a self-supervised next-latent-frame predictor leverages surprise (prediction error) to drive memory and event segmentation. On VSI-SUPER, this approach substantially outperforms leading proprietary baselines, showing that spatial supersensing requires models that not only see but also anticipate, select, and organize experience.

[145] InfinityStar: Unified Spacetime AutoRegressive Modeling for Visual Generation

Jinlai Liu, Jian Han, Bin Yan, Hui Wu, Fengda Zhu, Xing Wang, Yi Jiang, Bingyue Peng, Zehuan Yuan

Main category: cs.CV

TL;DR: InfinityStar is a unified spacetime autoregressive framework for high-resolution image and video synthesis that outperforms existing autoregressive models and competes with diffusion methods while being 10x faster.

Details

Motivation: To create a unified framework that captures both spatial and temporal dependencies for various generation tasks (text-to-image, text-to-video, etc.) using autoregressive modeling, building on recent successes in vision and language.

Method: A purely discrete autoregressive approach that jointly models spatial and temporal dependencies within a single architecture, supporting multiple generation tasks through straightforward temporal autoregression.

Result: Achieves 83.74 on VBench, outperforming all autoregressive models and even some diffusion competitors like HunyuanVideo. Generates 5s 720p videos 10x faster than leading diffusion methods, making it the first autoregressive video generator capable of industrial-level 720p video production.

Conclusion: InfinityStar demonstrates that unified spacetime autoregressive modeling can achieve state-of-the-art performance in high-resolution video generation while being significantly more efficient than diffusion-based approaches, opening new possibilities for efficient, high-quality video synthesis.

Abstract: We introduce InfinityStar, a unified spacetime autoregressive framework for high-resolution image and dynamic video synthesis. Building on the recent success of autoregressive modeling in both vision and language, our purely discrete approach jointly captures spatial and temporal dependencies within a single architecture. This unified design naturally supports a variety of generation tasks such as text-to-image, text-to-video, image-to-video, and long interactive video synthesis via straightforward temporal autoregression. Extensive experiments demonstrate that InfinityStar scores 83.74 on VBench, outperforming all autoregressive models by large margins, even surpassing some diffusion competitors like HunyuanVideo. Without extra optimizations, our model generates a 5s, 720p video approximately 10x faster than leading diffusion-based methods. To our knowledge, InfinityStar is the first discrete autoregressive video generator capable of producing industrial level 720p videos. We release all code and models to foster further research in efficient, high-quality video generation.

[146] Tracking and Understanding Object Transformations

Yihong Sun, Xinyu Yang, Jennifer J. Sun, Bharath Hariharan

Main category: cs.CV

TL;DR: Introduces Track Any State task for tracking objects through transformations with state change detection, presents TubeletGraph zero-shot system and VOST-TAS benchmark dataset.

Details

Motivation: Existing tracking methods fail when objects undergo significant appearance changes during transformations (e.g., apple being cut, butterfly emerging from cocoon), losing track of targets.

Method: TubeletGraph system identifies overlooked tracks, integrates them using semantic and proximity priors, then generates state graphs describing object transformations over time.

Result: Achieves state-of-the-art tracking performance under transformations while demonstrating deep understanding of object transformations and capabilities in temporal grounding and semantic reasoning.

Conclusion: TubeletGraph successfully addresses the limitation of tracking objects through state transformations and provides comprehensive understanding of object dynamics.

Abstract: Real-world objects frequently undergo state transformations. From an apple being cut into pieces to a butterfly emerging from its cocoon, tracking through these changes is important for understanding real-world objects and dynamics. However, existing methods often lose track of the target object after transformation, due to significant changes in object appearance. To address this limitation, we introduce the task of Track Any State: tracking objects through transformations while detecting and describing state changes, accompanied by a new benchmark dataset, VOST-TAS. To tackle this problem, we present TubeletGraph, a zero-shot system that recovers missing objects after transformation and maps out how object states are evolving over time. TubeletGraph first identifies potentially overlooked tracks, and determines whether they should be integrated based on semantic and proximity priors. Then, it reasons about the added tracks and generates a state graph describing each observed transformation. TubeletGraph achieves state-of-the-art tracking performance under transformations, while demonstrating deeper understanding of object transformations and promising capabilities in temporal grounding and semantic reasoning for complex object transformations. Code, additional results, and the benchmark dataset are available at https://tubelet-graph.github.io.

[147] Carousel: A High-Resolution Dataset for Multi-Target Automatic Image Cropping

Rafe Loya, Andrew Hamara, Benjamin Estell, Benjamin Kilpatrick, Andrew C. Freeman

Main category: cs.CV

TL;DR: This paper addresses the problem of generating multiple distinct aesthetic crops from images, motivated by social media applications, and introduces a new dataset with human labels.

Details

Motivation: Modern social media applications require multiple distinct aesthetic crops from single images, but existing methods focus only on producing singular crops.

Method: Evaluated several single-crop models combined with an image partitioning algorithm as a pre-processing step to generate multiple crops.

Result: Introduced a dataset of 277 images with human labels for evaluating multiple crop generation methods.

Conclusion: The paper establishes the problem of multiple aesthetic crop generation and provides a dataset for future research in this area.

Abstract: Automatic image cropping is a method for maximizing the human-perceived quality of cropped regions in photographs. Although several works have proposed techniques for producing singular crops, little work has addressed the problem of producing multiple, distinct crops with aesthetic appeal. In this paper, we motivate the problem with a discussion on modern social media applications, introduce a dataset of 277 relevant images and human labels, and evaluate the efficacy of several single-crop models with an image partitioning algorithm as a pre-processing step. The dataset is available at https://github.com/RafeLoya/carousel.

[148] Practical solutions to the relative pose of three calibrated cameras

Charalambos Tzamos, Viktor Kocur, Yaqing Ding, Daniel Barath, Zuzana Berger Haladova, Torsten Sattler, Zuzana Kukelova

Main category: cs.CV

TL;DR: Novel efficient solvers for estimating relative pose of three calibrated cameras from four point correspondences using approximate geometry estimation.

Details

Motivation: To solve the challenging problem of estimating relative pose from minimal correspondences (4 points) for three calibrated cameras, which is computationally difficult.

Method: Use four correspondences to estimate approximate geometry of first two views, modeling as affine or perspective geometry using additional approximate correspondence (mean point of three input points). Leverages existing minimal solvers like 4-point affine fundamental matrix, 5-point relative pose solver, and P3P solver.

Result: Proposed solvers achieve state-of-the-art results when coupled with local optimization, with the mean-point correspondence solver being more robust and accurate than affine-based solver.

Conclusion: The novel solvers provide efficient and easy-to-implement solutions for three-camera relative pose estimation from minimal correspondences, with the mean-point approach showing superior performance.

Abstract: We study the challenging problem of estimating the relative pose of three calibrated cameras from four point correspondences. We propose novel efficient solutions to this problem that are based on the simple idea of using four correspondences to estimate an approximate geometry of the first two views. We model this geometry either as an affine or a fully perspective geometry estimated using one additional approximate correspondence. We generate such an approximate correspondence using a very simple and efficient strategy, where the new point is the mean point of three corresponding input points. The new solvers are efficient and easy to implement, since they are based on existing efficient minimal solvers, i.e., the 4-point affine fundamental matrix, the well-known 5-point relative pose solver, and the P3P solver. Extensive experiments on real data show that the proposed solvers, when properly coupled with local optimization, achieve state-of-the-art results, with the novel solver based on approximate mean-point correspondences being more robust and accurate than the affine-based solver.

[149] Bridging Generative and Discriminative Noisy-Label Learning via Direction-Agnostic EM Formulation

Fengbei Liu, Chong Wang, Yuanhong Chen, Yuyuan Liu, Gustavo Carneiro

Main category: cs.CV

TL;DR: A generative noisy-label learning framework that is direction-agnostic, avoids explicit image synthesis, and uses instance-specific label priors to achieve state-of-the-art performance with lower computational cost.

Details

Motivation: Existing generative methods for noisy-label learning introduce extra complexity, fix data-generating directions, and assume uniform label priors, limiting their effectiveness and adaptability.

Method: Proposes a single-stage EM framework with direction-agnostic optimization, replaces intractable generative terms with discriminative proxies, and introduces Partial-Label Supervision for instance-specific label priors.

Result: Achieves state-of-the-art accuracy, lower transition-matrix estimation error, and substantially reduced training compute on vision and NLP noisy-label benchmarks.

Conclusion: The framework successfully combines generative modeling benefits with discriminative efficiency, offering a principled yet practical solution for noisy-label learning.

Abstract: Although noisy-label learning is often approached with discriminative methods for simplicity and speed, generative modeling offers a principled alternative by capturing the joint mechanism that produces features, clean labels, and corrupted observations. However, prior work typically (i) introduces extra latent variables and heavy image generators that bias training toward reconstruction, (ii) fixes a single data-generating direction ((Y\rightarrow!X) or (X\rightarrow!Y)), limiting adaptability, and (iii) assumes a uniform prior over clean labels, ignoring instance-level uncertainty. We propose a single-stage, EM-style framework for generative noisy-label learning that is \emph{direction-agnostic} and avoids explicit image synthesis. First, we derive a single Expectation-Maximization (EM) objective whose E-step specializes to either causal orientation without changing the overall optimization. Second, we replace the intractable (p(X\mid Y)) with a dataset-normalized discriminative proxy computed using a discriminative classifier on the finite training set, retaining the structural benefits of generative modeling at much lower cost. Third, we introduce \emph{Partial-Label Supervision} (PLS), an instance-specific prior over clean labels that balances coverage and uncertainty, improving data-dependent regularization. Across standard vision and natural language processing (NLP) noisy-label benchmarks, our method achieves state-of-the-art accuracy, lower transition-matrix estimation error, and substantially less training compute than current generative and discriminative baselines. Code: https://github.com/lfb-1/GNL

[150] Robust Self-calibration of Focal Lengths from the Fundamental Matrix

Viktor Kocur, Daniel Kyselica, Zuzana Kukelova

Main category: cs.CV

TL;DR: Proposes an iterative method for robust self-calibration of two cameras from fundamental matrix, improving accuracy over Bougnoux formula by estimating focal lengths and principal points with priors.

Details

Motivation: The Bougnoux formula for camera self-calibration has limitations: it yields inaccurate results due to singularities, is sensitive to noise in fundamental matrix, and depends on assumed principal point positions.

Method: Developed an efficient iterative method to estimate focal lengths and principal points using priors, plus a computationally efficient model check for RANSAC that improves accuracy while reducing computation time.

Result: Extensive experiments on real and synthetic data show significant improvements in focal length estimation accuracy over Bougnoux formula and other state-of-the-art methods, even with inaccurate priors.

Conclusion: The proposed iterative method provides robust and accurate camera self-calibration, overcoming limitations of traditional approaches and performing well even with imperfect prior information.

Abstract: The problem of self-calibration of two cameras from a given fundamental matrix is one of the basic problems in geometric computer vision. Under the assumption of known principal points and square pixels, the well-known Bougnoux formula offers a means to compute the two unknown focal lengths. However, in many practical situations, the formula yields inaccurate results due to commonly occurring singularities. Moreover, the estimates are sensitive to noise in the computed fundamental matrix and to the assumed positions of the principal points. In this paper, we therefore propose an efficient and robust iterative method to estimate the focal lengths along with the principal points of the cameras given a fundamental matrix and priors for the estimated camera parameters. In addition, we study a computationally efficient check of models generated within RANSAC that improves the accuracy of the estimated models while reducing the total computational time. Extensive experiments on real and synthetic data show that our iterative method brings significant improvements in terms of the accuracy of the estimated focal lengths over the Bougnoux formula and other state-of-the-art methods, even when relying on inaccurate priors.

[151] LEAP-VO: Long-term Effective Any Point Tracking for Visual Odometry

Weirong Chen, Le Chen, Rui Wang, Marc Pollefeys

Main category: cs.CV

TL;DR: LEAP-VO is a robust visual odometry system that uses long-term point tracking to handle occlusions and dynamic scenes, outperforming existing methods across benchmarks.

Details

Motivation: Existing visual odometry methods focus on two-view tracking and ignore temporal context, leading to poor performance in occlusions, dynamic objects, and low-texture areas.

Method: Proposes LEAP module combining visual, inter-track, and temporal cues with anchors for dynamic track estimation, plus temporal probabilistic formulation with iterative refinement for uncertainty reasoning.

Result: LEAP-VO significantly outperforms existing baselines across various visual odometry benchmarks.

Conclusion: Long-term point tracking as front-end with probabilistic uncertainty modeling enables robust visual odometry in challenging scenarios.

Abstract: Visual odometry estimates the motion of a moving camera based on visual input. Existing methods, mostly focusing on two-view point tracking, often ignore the rich temporal context in the image sequence, thereby overlooking the global motion patterns and providing no assessment of the full trajectory reliability. These shortcomings hinder performance in scenarios with occlusion, dynamic objects, and low-texture areas. To address these challenges, we present the Long-term Effective Any Point Tracking (LEAP) module. LEAP innovatively combines visual, inter-track, and temporal cues with mindfully selected anchors for dynamic track estimation. Moreover, LEAP’s temporal probabilistic formulation integrates distribution updates into a learnable iterative refinement module to reason about point-wise uncertainty. Based on these traits, we develop LEAP-VO, a robust visual odometry system adept at handling occlusions and dynamic scenes. Our mindful integration showcases a novel practice by employing long-term point tracking as the front-end. Extensive experiments demonstrate that the proposed pipeline significantly outperforms existing baselines across various visual odometry benchmarks.

[152] Revealing the structure-property relationships of copper alloys with FAGC

Yuexing Han, Ruijie Li, Guanxin Wan, Gan Hu, Yi Liu, Bing Wang

Main category: cs.CV

TL;DR: The paper presents a feature augmentation method (FAGC) to predict electrical conductivity and hardness of Cu-Cr-Zr alloys from microstructural images, achieving high accuracy with limited training data.

Details

Motivation: Cu-Cr-Zr alloys are important for electronics and power industries, but limited sample availability has hindered studies linking microstructure to key properties like electrical conductivity and hardness.

Method: Used FAGC feature augmentation in pre-shape space to enhance microstructural images, constructed pseudo-labels to expand training samples, and applied various machine learning models for performance prediction.

Result: Achieved superior performance with decision tree classifier using 100 augmented samples (R²=0.978 for electrical conductivity, R²=0.998 for hardness). Found that regions with reduced image noise contribute more to electrical conductivity.

Conclusion: FAGC method effectively overcomes limited image data challenges in materials science, providing a powerful tool for establishing quantitative microstructure-property relationships.

Abstract: Cu-Cr-Zr alloys play a crucial role in electronic devices and the electric power industry, where their electrical conductivity and hardness are of great importance. However, due to the scarcity of available samples, there has been a lack of effective studies exploring the relationship between the microstructural images of Cu-Cr-Zr alloys and their key properties. In this paper, the FAGC feature augmentation method is employed to enhance the microstructural images of Cu-Cr-Zr alloys within a feature space known as the pre-shape space. Pseudo-labels are then constructed to expand the number of training samples. These features are then input into various machine learning models to construct performance prediction models for the alloy. Finally, we validate the impact of different machine learning methods and the number of augmented features on prediction accuracy through experiments. Experimental results demonstrate that our method achieves superior performance in predicting electrical conductivity ((R^2=0.978)) and hardness ((R^2=0.998)) when using the decision tree classifier with 100 augmented samples. Further analysis reveals that regions with reduced image noise, such as fewer grain or phase boundaries, exhibit higher contributions to electrical conductivity. These findings highlight the potential of the FAGC method in overcoming the challenges of limited image data in materials science, offering a powerful tool for establishing detailed and quantitative relationships between complex microstructures and material properties.

[153] EMHI: A Multimodal Egocentric Human Motion Dataset with HMD and Body-Worn IMUs

Zhen Fan, Peng Dai, Zhuo Su, Xu Gao, Zheng Lv, Jiarui Zhang, Tianyuan Du, Guidong Wang, Yang Zhang

Main category: cs.CV

TL;DR: EMHI is a multimodal egocentric human motion dataset with HMD and IMUs, collected under real VR conditions, containing synchronized stereo images and IMU data with SMPL pose annotations. The dataset enables better egocentric human pose estimation through multimodal fusion.

Details

Motivation: Current egocentric HPE methods suffer from self-occlusion in images or sparseness/drift in IMU data, and lack real-world multimodal datasets collected under actual VR product conditions.

Method: Created EMHI dataset with 885 sequences from 58 subjects performing 39 actions, providing synchronized stereo images from HMD cameras and body-worn IMU data with SMPL annotations. Also introduced MEPoser baseline method with multimodal fusion encoder, temporal feature encoder, and MLP regression heads.

Result: EMHI contains 28.5 hours of recording with validated annotations. MEPoser outperforms single-modal methods on EMHI, demonstrating the dataset’s value for solving egocentric HPE problems.

Conclusion: EMHI dataset and MEPoser method advance egocentric HPE research and accelerate practical implementation in VR/AR products by providing real-world multimodal data and effective fusion approach.

Abstract: Egocentric human pose estimation (HPE) using wearable sensors is essential for VR/AR applications. Most methods rely solely on either egocentric-view images or sparse Inertial Measurement Unit (IMU) signals, leading to inaccuracies due to self-occlusion in images or the sparseness and drift of inertial sensors. Most importantly, the lack of real-world datasets containing both modalities is a major obstacle to progress in this field. To overcome the barrier, we propose EMHI, a multimodal \textbf{E}gocentric human \textbf{M}otion dataset with \textbf{H}ead-Mounted Display (HMD) and body-worn \textbf{I}MUs, with all data collected under the real VR product suite. Specifically, EMHI provides synchronized stereo images from downward-sloping cameras on the headset and IMU data from body-worn sensors, along with pose annotations in SMPL format. This dataset consists of 885 sequences captured by 58 subjects performing 39 actions, totaling about 28.5 hours of recording. We evaluate the annotations by comparing them with optical marker-based SMPL fitting results. To substantiate the reliability of our dataset, we introduce MEPoser, a new baseline method for multimodal egocentric HPE, which employs a multimodal fusion encoder, temporal feature encoder, and MLP-based regression heads. The experiments on EMHI show that MEPoser outperforms existing single-modal methods and demonstrates the value of our dataset in solving the problem of egocentric HPE. We believe the release of EMHI and the method could advance the research of egocentric HPE and expedite the practical implementation of this technology in VR/AR products.

[154] Pseudo-Stereo Inputs: A Solution to the Occlusion Challenge in Self-Supervised Stereo Matching

Ruizhi Yang, Xingqiang Li, Jiajun Bai, Jinsong Du

Main category: cs.CV

TL;DR: Proposes a probabilistic framework for self-supervised stereo matching that addresses occlusion challenges by transforming one-sided valid signals into bilateral valid feedback using pseudo-stereo inputs.

Details

Motivation: Self-supervised stereo matching relies on photometric consistency but is fundamentally hindered by occlusion problems, where existing methods fail to provide complete solutions by focusing on erroneous feedback removal or additional regularities.

Method: Uses a pseudo-stereo inputs strategy that decouples input and feedback, transforming fixed one-sided valid signals into probabilistic acquisition of valid feedback from both sides of occluders without additional constraints.

Result: Qualitative results show occlusion problem is resolved with fully symmetrical and identical performance on both sides of occluding objects. Quantitative experiments validate significant performance improvements.

Conclusion: The proposed probabilistic framework provides a fundamental solution to occlusion challenges in self-supervised stereo matching, achieving symmetrical performance and substantial improvements over existing methods.

Abstract: Self-supervised stereo matching holds great promise by eliminating the reliance on expensive ground-truth data. Its dominant paradigm, based on photometric consistency, is however fundamentally hindered by the occlusion challenge – an issue that persists regardless of network architecture. The essential insight is that for any occluders, valid feedback signals can only be derived from the unoccluded areas on one side of the occluder. Existing methods attempt to address this by focusing on the erroneous feedback from the other side, either by identifying and removing it, or by introducing additional regularities for correction on that basis. Nevertheless, these approaches have failed to provide a complete solution. This work proposes a more fundamental solution. The core idea is to transform the fixed state of one-sided valid and one-sided erroneous signals into a probabilistic acquisition of valid feedback from both sides of an occluder. This is achieved through a complete framework, centered on a pseudo-stereo inputs strategy that decouples the input and feedback, without introducing any additional constraints. Qualitative results visually demonstrate that the occlusion problem is resolved, manifested by fully symmetrical and identical performance on both flanks of occluding objects. Quantitative experiments thoroughly validate the significant performance improvements resulting from solving the occlusion challenge.

[155] Residual Kolmogorov-Arnold Network for Enhanced Deep Learning

Ray Congrui Yu, Sherry Wu, Jiang Gui

Main category: cs.CV

TL;DR: RKAN is a compact plug-in module that enhances traditional CNNs by integrating polynomial feature transformations, improving performance while reducing computational costs and overfitting risks.

Details

Motivation: Deep CNNs are computationally inefficient and prone to overfitting due to their linear nature and many layers, especially with small datasets.

Method: Developed Residual Kolmogorov-Arnold Network (RKAN) as a plug-in module that can be added to any stage of traditional CNNs to integrate supportive polynomial feature transformations.

Result: RKAN consistently improves baseline models across different vision tasks and benchmarks, achieving state-of-the-art performance.

Conclusion: RKAN provides an efficient solution to enhance CNN performance while addressing computational inefficiency and overfitting issues in deep networks.

Abstract: Despite their immense success, deep convolutional neural networks (CNNs) can be difficult to optimize and costly to train due to hundreds of layers within the network depth. Conventional convolutional operations are fundamentally limited by their linear nature along with fixed activations, where many layers are needed to learn meaningful patterns in data. Because of the sheer size of these networks, this approach is simply computationally inefficient, and poses overfitting or gradient explosion risks, especially in small datasets. As a result, we introduce a “plug-in” module, called Residual Kolmogorov-Arnold Network (RKAN). Our module is highly compact, so it can be easily added into any stage (level) of traditional deep networks, where it learns to integrate supportive polynomial feature transformations to existing convolutional frameworks. RKAN offers consistent improvements over baseline models in different vision tasks and widely tested benchmarks, accomplishing cutting-edge performance on them.

[156] Are Minimal Radial Distortion Solvers Necessary for Relative Pose Estimation?

Charalambos Tzamos, Viktor Kocur, Yaqing Ding, Torsten Sattler, Zuzana Kukelova

Main category: cs.CV

TL;DR: Simple approach combining efficient pinhole solver with sampled radial distortion parameters performs similarly or better than complex minimal distortion solvers at faster run-times.

Details

Motivation: Most cameras exhibit radial distortion, but modeling it with minimal solvers is complex and slow compared to pinhole solvers. Not modeling distortion leads to worse results.

Method: Combine efficient pinhole solver with sampled radial distortion parameters instead of using complex minimal radial distortion solvers.

Result: The simple approach performs similarly or better than most accurate minimal distortion solvers at faster run-times, and significantly more accurate than faster non-minimal solvers.

Conclusion: Complex radial distortion solvers are not necessary in practice; simple pinhole solver with sampled distortion parameters is sufficient and more efficient.

Abstract: Estimating the relative pose between two cameras is a fundamental step in many applications such as Structure-from-Motion. The common approach to relative pose estimation is to apply a minimal solver inside a RANSAC loop. Highly efficient solvers exist for pinhole cameras. Yet, (nearly) all cameras exhibit radial distortion. Not modeling radial distortion leads to (significantly) worse results. However, minimal radial distortion solvers are significantly more complex than pinhole solvers, both in terms of run-time and implementation efforts. This paper compares radial distortion solvers with a simple-to-implement approach that combines an efficient pinhole solver with sampled radial distortion parameters. Extensive experiments on multiple datasets and RANSAC variants show that this simple approach performs similarly or better than the most accurate minimal distortion solvers at faster run-times while being significantly more accurate than faster non-minimal solvers. We clearly show that complex radial distortion solvers are not necessary in practice. Code and benchmark are available at https://github.com/kocurvik/rd.

[157] Three-view Focal Length Recovery From Homographies

Yaqing Ding, Viktor Kocur, Zuzana Berger Haladová, Qianliang Wu, Shen Cai, Jian Yang, Zuzana Kukelova

Main category: cs.CV

TL;DR: Novel approach for recovering focal lengths from three-view homographies using normal vector consistency and elimination techniques, enabling efficient polynomial solving for various camera configurations.

Details

Motivation: Existing methods rely on two-view solvers which may be less efficient and accurate; three-view homographies provide additional constraints for better focal length recovery.

Method: Examine consistency of normal vectors between homographies to derive explicit constraints, convert problems into solving polynomials using Sturm sequence or hidden variable technique.

Result: Proposed solvers are faster and more accurate than existing two-view methods, handling four different camera focal length configurations.

Conclusion: Three-view homographies provide valuable constraints for focal length recovery, with the proposed method offering improved performance over traditional approaches.

Abstract: In this paper, we propose a novel approach for recovering focal lengths from three-view homographies. By examining the consistency of normal vectors between two homographies, we derive new explicit constraints between the focal lengths and homographies using an elimination technique. We demonstrate that three-view homographies provide two additional constraints, enabling the recovery of one or two focal lengths. We discuss four possible cases, including three cameras having an unknown equal focal length, three cameras having two different unknown focal lengths, three cameras where one focal length is known, and the other two cameras have equal or different unknown focal lengths. All the problems can be converted into solving polynomials in one or two unknowns, which can be efficiently solved using Sturm sequence or hidden variable technique. Evaluation using both synthetic and real data shows that the proposed solvers are both faster and more accurate than methods relying on existing two-view solvers. The code and data are available on https://github.com/kocurvik/hf

[158] Optimized Minimal 3D Gaussian Splatting

Joo Chan Lee, Jong Hwan Ko, Eunbyung Park

Main category: cs.CV

TL;DR: OMG reduces 3D Gaussian Splatting storage by 50% and enables 600+ FPS rendering through minimal Gaussian primitives and compact attribute representation.

Details

Motivation: Current 3DGS compression methods use too many Gaussians and focus mainly on attribute compression, leading to high computational costs and storage overhead. Reducing Gaussian count is essential for efficiency.

Method: Proposes Optimized Minimal Gaussians (OMG) with: 1) Distinct Gaussian selection to minimize redundancy, 2) Compact attribute representation for continuity and irregularity, 3) Sub-vector quantization for improved irregularity representation with negligible codebook.

Result: OMG reduces storage by nearly 50% compared to state-of-the-art methods while maintaining high rendering quality and achieving 600+ FPS rendering performance.

Conclusion: OMG provides an efficient 3DGS representation that significantly reduces storage requirements and computational costs while preserving rendering quality, enabling real-time high-performance applications.

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful representation for real-time, high-performance rendering, enabling a wide range of applications. However, representing 3D scenes with numerous explicit Gaussian primitives imposes significant storage and memory overhead. Recent studies have shown that high-quality rendering can be achieved with a substantially reduced number of Gaussians when represented with high-precision attributes. Nevertheless, existing 3DGS compression methods still rely on a relatively large number of Gaussians, focusing primarily on attribute compression. This is because a smaller set of Gaussians becomes increasingly sensitive to lossy attribute compression, leading to severe quality degradation. Since the number of Gaussians is directly tied to computational costs, it is essential to reduce the number of Gaussians effectively rather than only optimizing storage. In this paper, we propose Optimized Minimal Gaussians representation (OMG), which significantly reduces storage while using a minimal number of primitives. First, we determine the distinct Gaussian from the near ones, minimizing redundancy without sacrificing quality. Second, we propose a compact and precise attribute representation that efficiently captures both continuity and irregularity among primitives. Additionally, we propose a sub-vector quantization technique for improved irregularity representation, maintaining fast training with a negligible codebook size. Extensive experiments demonstrate that OMG reduces storage requirements by nearly 50% compared to the previous state-of-the-art and enables 600+ FPS rendering while maintaining high rendering quality. Our source code is available at https://maincold2.github.io/omg/.

[159] What Time Tells Us? An Explorative Study of Time Awareness Learned from Static Images

Dongheng Lin, Han Hu, Jianbo Jiao

Main category: cs.CV

TL;DR: The paper proposes learning time awareness from static images through a Time-Image Contrastive Learning (TICL) approach using a new Time-Oriented Collection dataset, achieving state-of-the-art timestamp estimation and demonstrating strong performance on time-aware downstream tasks.

Details

Motivation: Time becomes visible through illumination changes in visual scenes, inspiring the exploration of whether time awareness can be learned from static images to understand what time tells us about visual content.

Method: Introduced a Time-Oriented Collection (TOC) dataset with 130,906 timestamped images and proposed Time-Image Contrastive Learning (TICL) to jointly model timestamps and visual representations through cross-modal contrastive learning.

Result: TICL achieves state-of-the-art performance on timestamp estimation and the learned time-aware embeddings show strong capability in downstream tasks including time-based image retrieval, video scene classification, and time-aware image editing.

Conclusion: Time-related visual cues can be effectively learned from static images and benefit various vision tasks, laying a foundation for future research on understanding time-related visual context.

Abstract: Time becomes visible through illumination changes in what we see. Inspired by this, in this paper we explore the potential to learn time awareness from static images, trying to answer: what time tells us? To this end, we first introduce a Time-Oriented Collection (TOC) dataset, which contains 130,906 images with reliable timestamps. Leveraging this dataset, we propose a Time-Image Contrastive Learning (TICL) approach to jointly model timestamps and related visual representations through cross-modal contrastive learning. We found that the proposed TICL, 1) not only achieves state-of-the-art performance on the timestamp estimation task, over various benchmark metrics, 2) but also, interestingly, though only seeing static images, the time-aware embeddings learned from TICL show strong capability in several time-aware downstream tasks such as time-based image retrieval, video scene classification, and time-aware image editing. Our findings suggest that time-related visual cues can be learned from static images and are beneficial for various vision tasks, laying a foundation for future research on understanding time-related visual context. Project page: https://rathgrith.github.io/timetells_release/

[160] CFReID: Continual Few-shot Person Re-Identification

Hao Ni, Lianli Gao, Pengpeng Zeng, Heng Tao Shen, Jingkuan Song

Main category: cs.CV

TL;DR: Proposes Continual Few-shot ReID (CFReID) - a new paradigm for person re-identification that incrementally learns from few-shot data across multiple domains while avoiding catastrophic forgetting, using only 5% of data compared to existing methods.

Details

Motivation: Real-world surveillance systems evolve dynamically, requiring ReID models to handle new domains continuously. Current Lifelong ReID methods need large labeled datasets which are impractical due to privacy and cost concerns.

Method: Proposes Stable Distribution Alignment (SDA) framework with two modules: Meta Distribution Alignment (MDA) and Prototype-based Few-shot Adaptation (PFA) to address few-shot learning and catastrophic forgetting from feature distribution perspective.

Result: Extensive experiments show SDA enhances few-shot learning and anti-forgetting capabilities. Using only 5% data (32 IDs) significantly outperforms state-of-the-art LReID methods that require 700-1000 IDs.

Conclusion: The proposed CFReID paradigm and SDA framework effectively address the challenges of continual learning with few-shot data, making ReID more practical for real-world surveillance applications with limited labeled data.

Abstract: Real-world surveillance systems are dynamically evolving, requiring a person Re-identification model to continuously handle newly incoming data from various domains. To cope with these dynamics, Lifelong ReID (LReID) has been proposed to learn and accumulate knowledge across multiple domains incrementally. However, LReID models need to be trained on large-scale labeled data for each unseen domain, which are typically inaccessible due to privacy and cost concerns. In this paper, we propose a new paradigm called Continual Few-shot ReID (CFReID), which requires models to be incrementally trained using few-shot data and tested on all seen domains. Under few-shot conditions, CFREID faces two core challenges: 1) learning knowledge from few-shot data of unseen domain, and 2) avoiding catastrophic forgetting of seen domains. To tackle these two challenges, we propose a Stable Distribution Alignment (SDA) framework from feature distribution perspective. Specifically, our SDA is composed of two modules, i.e., Meta Distribution Alignment (MDA) and Prototype-based Few-shot Adaptation (PFA). To support the study of CFReID, we establish an evaluation benchmark for CFReID on five publicly available ReID datasets. Extensive experiments demonstrate that our SDA can enhance the few-shot learning and anti-forgetting capabilities under few-shot conditions. Notably, our approach, using only 5% of the data, i.e., 32 IDs, significantly outperforms LReID’s state-of-the-art performance, which requires 700 to 1,000 IDs.

[161] CREA: A Collaborative Multi-Agent Framework for Creative Image Editing and Generation

Kavana Venkatesh, Connor Dunlop, Pinar Yanardag

Main category: cs.CV

TL;DR: CREA is a multi-agent collaborative framework for creative AI image editing that outperforms state-of-the-art methods in diversity, semantic alignment, and creative transformation.

Details

Motivation: Creativity in AI imagery requires generating visually compelling content with novel, expressive transformations, going beyond conventional prompt-based editing to achieve autonomous, iterative creative processes.

Method: A multi-agent collaborative framework with specialized AI agents that dynamically collaborate to conceptualize, generate, critique, and enhance images, mimicking the human creative process.

Result: Extensive evaluations show CREA significantly outperforms state-of-the-art methods in diversity, semantic alignment, and creative transformation.

Conclusion: This is the first work to introduce the task of creative editing, demonstrating that multi-agent collaboration can effectively address creative AI image generation challenges.

Abstract: Creativity in AI imagery remains a fundamental challenge, requiring not only the generation of visually compelling content but also the capacity to add novel, expressive, and artistically rich transformations to images. Unlike conventional editing tasks that rely on direct prompt-based modifications, creative image editing requires an autonomous, iterative approach that balances originality, coherence, and artistic intent. To address this, we introduce CREA, a novel multi-agent collaborative framework that mimics the human creative process. Our framework leverages a team of specialized AI agents who dynamically collaborate to conceptualize, generate, critique, and enhance images. Through extensive qualitative and quantitative evaluations, we demonstrate that CREA significantly outperforms state-of-the-art methods in diversity, semantic alignment, and creative transformation. To the best of our knowledge, this is the first work to introduce the task of creative editing.

[162] RadZero: Similarity-Based Cross-Attention for Explainable Vision-Language Alignment in Chest X-ray with Zero-Shot Multi-Task Capability

Jonggwon Park, Byungmu Yoon, Soobum Kim, Kyoyun Choi

Main category: cs.CV

TL;DR: RadZero is a novel vision-language alignment framework for chest X-rays with zero-shot multi-task capability, using VL-CABS for interpretable fine-grained reasoning and outperforming SOTA methods.

Details

Motivation: Existing multimodal models struggle to effectively utilize complex radiology reports and offer limited interpretability through attention visualizations.

Method: Uses VL-CABS to align text embeddings with local image features, employs LLMs to extract semantic sentences from reports, and uses multi-positive contrastive training with pre-trained vision encoder and additional Transformer layers.

Result: Outperforms state-of-the-art methods in zero-shot classification, grounding, and segmentation on public chest radiograph benchmarks.

Conclusion: RadZero demonstrates improved explainability in VL alignment and shows capability for open-vocabulary semantic segmentation in medical imaging.

Abstract: Recent advancements in multimodal models have significantly improved vision-language (VL) alignment in radiology. However, existing approaches struggle to effectively utilize complex radiology reports for learning and offer limited interpretability through attention probability visualizations. To address these challenges, we introduce $\textbf{RadZero}$, a novel framework for VL alignment in chest X-ray with zero-shot multi-task capability. A key component of our approach is $\textbf{VL-CABS}$ ($\textbf{V}$ision-$\textbf{L}$anguage $\textbf{C}$ross-$\textbf{A}$ttention $\textbf{B}$ased on $\textbf{S}$imilarity), which aligns text embeddings with local image features for interpretable, fine-grained VL reasoning. RadZero leverages large language models to extract concise semantic sentences from radiology reports and employs multi-positive contrastive training to effectively capture relationships between images and multiple relevant textual descriptions. It uses a pre-trained vision encoder with additional trainable Transformer layers, allowing efficient high-resolution image processing. By computing similarity between text embeddings and local image patch features, VL-CABS enables zero-shot inference with similarity probability for classification, and pixel-level VL similarity maps for grounding and segmentation. Experimental results on public chest radiograph benchmarks show that RadZero outperforms state-of-the-art methods in zero-shot classification, grounding, and segmentation. Furthermore, VL similarity map analysis highlights the potential of VL-CABS for improving explainability in VL alignment. Additionally, qualitative evaluation demonstrates RadZero’s capability for open-vocabulary semantic segmentation, further validating its effectiveness in medical imaging. Code is available at $\href{https://github.com/deepnoid-ai/RadZero}{https://github.com/deepnoid-ai/RadZero}$.

[163] EarthGPT-X: A Spatial MLLM for Multi-level Multi-Source Remote Sensing Imagery Understanding with Visual Prompting

Wei Zhang, Miaoxin Cai, Yaqian Ning, Tong Zhang, Yin Zhuang, Shijian Lu, He Chen, Jun Li, Xuerui Mao

Main category: cs.CV

TL;DR: EarthGPT-X is a flexible spatial multi-modal large language model for remote sensing that unifies multi-source imagery comprehension and performs both coarse-grained and fine-grained visual tasks using diverse visual prompts in a single framework.

Details

Motivation: Existing remote sensing MLLMs are limited to optical imagery and plain language interaction, preventing flexible and scalable real-world applications. The transfer of natural-domain MLLMs to remote sensing is hindered by heterogeneous sensing physics, diverse modalities, and unique spatial scales.

Method: Introduces a dual-prompt mechanism combining text instructions with various visual prompts (point, box, free-form), a comprehensive multi-source multi-level prompting dataset for hierarchical spatial reasoning, and a cross-domain one-stage fusion training strategy for efficient modality alignment.

Result: Extensive experiments demonstrate that EarthGPT-X substantially outperforms prior nature and remote sensing MLLMs, establishing the first framework capable of multi-source, multi-task, and multi-level interpretation using visual prompting in remote sensing scenarios.

Conclusion: EarthGPT-X represents a significant advancement in remote sensing MLLMs, providing the first flexible framework that can handle multi-source imagery, multiple tasks, and hierarchical spatial reasoning through visual prompting, overcoming limitations of previous approaches.

Abstract: Recent advances in natural-domain multi-modal large language models (MLLMs) have demonstrated effective spatial reasoning through visual and textual prompting. However, their direct transfer to remote sensing (RS) is hindered by heterogeneous sensing physics, diverse modalities, and unique spatial scales. Existing RS MLLMs are mainly limited to optical imagery and plain language interaction, preventing flexible and scalable real-world applications. In this article, EarthGPT-X is proposed, the first flexible spatial MLLM that unifies multi-source RS imagery comprehension and accomplishes both coarse-grained and fine-grained visual tasks under diverse visual prompts in a single framework. Distinct from prior models, EarthGPT-X introduces: 1) a dual-prompt mechanism combining text instructions with various visual prompts (i.e., point, box, and free-form) to mimic the versatility of referring in human life; 2) a comprehensive multi-source multi-level prompting dataset, the model advances beyond holistic image understanding to support hierarchical spatial reasoning, including scene-level understanding and fine-grained object attributes and relational analysis; 3) a cross-domain one-stage fusion training strategy, enabling efficient and consistent alignment across modalities and tasks. Extensive experiments demonstrate that EarthGPT-X substantially outperforms prior nature and RS MLLMs, establishing the first framework capable of multi-source, multi-task, and multi-level interpretation using visual prompting in RS scenarios.

[164] Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction

Weirong Chen, Ganlin Zhang, Felix Wimbauer, Rui Wang, Nikita Araslanov, Andrea Vedaldi, Daniel Cremers

Main category: cs.CV

TL;DR: BA-Track is a SLAM framework that handles dynamic scenes by separating camera-induced motion from object motion using a 3D point tracker, enabling reliable bundle adjustment and producing consistent dense reconstructions.

Details

Motivation: Traditional SLAM systems fail in dynamic scenes common in casual videos, where moving objects violate the static environment assumption. Existing methods either remove dynamic elements (incomplete reconstructions) or model motion independently (inconsistent estimates).

Method: Uses a 3D point tracker to decompose observed motion into camera-induced and object motion components. Combines bundle adjustment with learning-based tracking and lightweight depth refinement using scale maps for temporal consistency.

Result: Significant improvements in camera pose estimation and 3D reconstruction accuracy on challenging datasets. Produces temporally coherent and scale-consistent dense reconstructions that accommodate both static and dynamic elements.

Conclusion: The unified BA-Track framework successfully integrates motion decomposition, bundle adjustment, and depth refinement to handle dynamic scenes effectively while maintaining the reliability of traditional SLAM core components.

Abstract: Traditional SLAM systems, which rely on bundle adjustment, struggle with highly dynamic scenes commonly found in casual videos. Such videos entangle the motion of dynamic elements, undermining the assumption of static environments required by traditional systems. Existing techniques either filter out dynamic elements or model their motion independently. However, the former often results in incomplete reconstructions, whereas the latter can lead to inconsistent motion estimates. Taking a novel approach, this work leverages a 3D point tracker to separate the camera-induced motion from the observed motion of dynamic objects. By considering only the camera-induced component, bundle adjustment can operate reliably on all scene elements as a result. We further ensure depth consistency across video frames with lightweight post-processing based on scale maps. Our framework combines the core of traditional SLAM – bundle adjustment – with a robust learning-based 3D tracker front-end. Integrating motion decomposition, bundle adjustment and depth refinement, our unified framework, BA-Track, accurately tracks the camera motion and produces temporally coherent and scale-consistent dense reconstructions, accommodating both static and dynamic elements. Our experiments on challenging datasets reveal significant improvements in camera pose estimation and 3D reconstruction accuracy.

[165] WaveGuard: Robust Deepfake Detection and Source Tracing via Dual-Tree Complex Wavelet and Graph Neural Networks

Ziyuan He, Zhiqing Guo, Liejun Wang, Gaobo Yang, Yunfeng Diao, Dan Ma

Main category: cs.CV

TL;DR: WaveGuard is a proactive watermarking framework that embeds watermarks in high-frequency sub-bands using DT-CWT and SC-GNN to enhance robustness and imperceptibility against deepfake threats.

Details

Motivation: To address the increasing risks of deepfake technology such as privacy invasion and identity theft by developing a robust watermarking solution.

Method: Uses Dual-Tree Complex Wavelet Transform (DT-CWT) for frequency-domain embedding, Structural Consistency Graph Neural Network (SC-GNN) for visual quality preservation, and an attention module for embedding precision refinement.

Result: Outperforms state-of-the-art methods in both robustness and visual quality on face swap and reenactment tasks.

Conclusion: WaveGuard provides an effective solution for deepfake detection and prevention through advanced watermarking techniques with superior performance.

Abstract: Deepfake technology poses increasing risks such as privacy invasion and identity theft. To address these threats, we propose WaveGuard, a proactive watermarking framework that enhances robustness and imperceptibility via frequency-domain embedding and graph-based structural consistency. Specifically, we embed watermarks into high-frequency sub-bands using Dual-Tree Complex Wavelet Transform (DT-CWT) and employ a Structural Consistency Graph Neural Network (SC-GNN) to preserve visual quality. We also design an attention module to refine embedding precision. Experimental results on face swap and reenactment tasks demonstrate that WaveGuard outperforms state-of-the-art methods in both robustness and visual quality. Code is available at https://github.com/vpsg-research/WaveGuard.

[166] Revisiting Residual Connections: Orthogonal Updates for Stable and Efficient Deep Networks

Giyeong Oh, Woohyun Cho, Siyeol Kim, Suhwan Choi, Youngjae Yu

Main category: cs.CV

TL;DR: Orthogonal Residual Update decomposes module outputs into components parallel and orthogonal to the input stream, adding only the orthogonal component to encourage learning of novel features rather than reinforcing existing directions.

Details

Motivation: Standard residual connections directly add module outputs to inputs, which may predominantly reinforce existing feature directions and underutilize the module's capacity for learning entirely new features.

Method: Decompose the module’s output relative to the input stream and add only the component orthogonal to this stream, guiding modules to contribute primarily new representational directions.

Result: Improves generalization accuracy and training stability across diverse architectures (ResNetV2, Vision Transformers) and datasets (CIFARs, TinyImageNet, ImageNet-1k), achieving +3.78 pp top-1 accuracy gain for ViT-B on ImageNet-1k.

Conclusion: Orthogonal residual updates foster richer feature learning and more efficient training by encouraging modules to contribute novel representational directions rather than reinforcing existing feature streams.

Abstract: Residual connections are pivotal for deep neural networks, enabling greater depth by mitigating vanishing gradients. However, in standard residual updates, the module’s output is directly added to the input stream. This can lead to updates that predominantly reinforce or modulate the existing stream direction, potentially underutilizing the module’s capacity for learning entirely novel features. In this work, we introduce Orthogonal Residual Update: we decompose the module’s output relative to the input stream and add only the component orthogonal to this stream. This design aims to guide modules to contribute primarily new representational directions, fostering richer feature learning while promoting more efficient training. We demonstrate that our orthogonal update strategy improves generalization accuracy and training stability across diverse architectures (ResNetV2, Vision Transformers) and datasets (CIFARs, TinyImageNet, ImageNet-1k), achieving, for instance, a +3.78 pp top-1 accuracy gain for ViT-B on ImageNet-1k.

[167] DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution

Zheng Chen, Zichen Zou, Kewei Zhang, Xiongfei Su, Xin Yuan, Yong Guo, Yulun Zhang

Main category: cs.CV

TL;DR: DOVE is an efficient one-step diffusion model for real-world video super-resolution that achieves comparable performance to multi-step methods while being 28x faster.

Details

Motivation: Diffusion models show promise for video super-resolution but are slow due to requiring dozens of sampling steps. Single-step sampling could solve this but is challenging due to high training costs and fidelity requirements.

Method: Fine-tune pretrained CogVideoX using latent-pixel training strategy with two-stage adaptation to VSR task. Construct HQ-VSR dataset with video processing pipeline for enhanced training.

Result: DOVE achieves comparable or superior performance to multi-step diffusion-based VSR methods while offering 28x speed-up over methods like MGLD-VSR.

Conclusion: DOVE successfully demonstrates efficient one-step diffusion for VSR with significant speed improvements while maintaining high-quality restoration.

Abstract: Diffusion models have demonstrated promising performance in real-world video super-resolution (VSR). However, the dozens of sampling steps they require, make inference extremely slow. Sampling acceleration techniques, particularly single-step, provide a potential solution. Nonetheless, achieving one step in VSR remains challenging, due to the high training overhead on video data and stringent fidelity demands. To tackle the above issues, we propose DOVE, an efficient one-step diffusion model for real-world VSR. DOVE is obtained by fine-tuning a pretrained video diffusion model (i.e., CogVideoX). To effectively train DOVE, we introduce the latent-pixel training strategy. The strategy employs a two-stage scheme to gradually adapt the model to the video super-resolution task. Meanwhile, we design a video processing pipeline to construct a high-quality dataset tailored for VSR, termed HQ-VSR. Fine-tuning on this dataset further enhances the restoration capability of DOVE. Extensive experiments show that DOVE exhibits comparable or superior performance to multi-step diffusion-based VSR methods. It also offers outstanding inference efficiency, achieving up to a 28$\times$ speed-up over existing methods such as MGLD-VSR. Code is available at: https://github.com/zhengchen1999/DOVE.

Miaoyu Li, Qin Chao, Boyang Li

Main category: cs.CV

TL;DR: Causal2Needles is a new benchmark for evaluating Video-Language Models’ ability to extract information from two separate locations in long videos and understand causal relationships in human behaviors.

Details

Motivation: Existing benchmarks don't adequately assess VLMs' ability to extract information from multiple locations in long videos and understand cause-effect relationships in human behaviors.

Method: The benchmark uses three question types: noncausal one-needle, causal one-needle, and causal two-needle questions. It introduces two question formats to prevent textual bias: locating video clips containing answers and verbal description of visual details.

Result: Models that perform well on existing benchmarks struggle with causal two-needle questions, and performance decreases as the distance between the two information locations increases.

Conclusion: Current VLMs have critical limitations in handling long-context video understanding, particularly for extracting and connecting information from multiple locations and understanding causal relationships.

Abstract: Properly evaluating the ability of Video-Language Models (VLMs) to understand long videos remains a challenge. We propose a long-context video understanding benchmark, Causal2Needles, that assesses two crucial abilities insufficiently addressed by existing benchmarks: (1) extracting information from two separate locations (two needles) in a long video and understanding them jointly, and (2) modeling the world in terms of cause and effect in human behaviors. Causal2Needles evaluates these abilities using noncausal one-needle, causal one-needle, and causal two-needle questions. The most complex question type, causal two-needle questions, require extracting information from both the cause and effect events from a long video and the associated narration text. To prevent textual bias, we introduce two complementary question formats: locating the video clip containing the answer, and verbal description of a visual detail from that video clip. Our experiments reveal that models excelling on existing benchmarks struggle with causal 2-needle questions, and the model performance is negatively correlated with the distance between the two needles. These findings highlight critical limitations in current VLMs. The dataset is available at: https://huggingface.co/datasets/causal2needles/Causal2Needles

[169] MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

Yolo Yunlong Tang, Pinxin Liu, Zhangyun Tan, Mingqian Feng, Rui Mao, Chao Huang, Jing Bi, Yunzhong Xiao, Susan Liang, Hang Hua, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Chenliang Xu

Main category: cs.CV

TL;DR: MMPerspective is the first benchmark to systematically evaluate multimodal large language models’ understanding of perspective geometry through 10 tasks across perception, reasoning, and robustness dimensions.

Details

Motivation: To understand the extent to which multimodal large language models internalize perspective geometry, which is fundamental to human visual perception but unclear in current MLLMs.

Method: Created MMPerspective benchmark with 2,711 real-world and synthetic images and 5,083 question-answer pairs across 10 tasks in three dimensions: Perspective Perception, Reasoning, and Robustness.

Result: Evaluation of 43 state-of-the-art MLLMs revealed significant limitations - models perform well on surface-level perceptual tasks but struggle with compositional reasoning and spatial consistency under perturbations.

Conclusion: MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems, revealing architecture-scale relationships and the benefits of chain-of-thought prompting.

Abstract: Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal large language models (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs’ understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities, such as vanishing point perception and counting, perspective type reasoning, line relationship understanding in 3D space, invariance to perspective-preserving transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations: while models demonstrate competence on surface-level perceptual tasks, they struggle with compositional reasoning and maintaining spatial consistency under perturbations. Our analysis further reveals intriguing patterns between model architecture, scale, and perspective capabilities, highlighting both robustness bottlenecks and the benefits of chain-of-thought prompting. MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems. Resources available at: https://yunlong10.github.io/MMPerspective/

[170] TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models

Yao Xiao, Qiqian Fu, Heyi Tao, Yuqun Wu, Zhen Zhu, Derek Hoiem

Main category: cs.CV

TL;DR: TextRegion combines image-text models with SAM2 to create text-aligned region tokens for detailed visual understanding without training, achieving strong performance on segmentation and grounding tasks.

Details

Motivation: Image-text models lack detailed spatial understanding while segmentation models lack text alignment, creating a need to combine both capabilities for comprehensive visual understanding.

Method: Training-free framework that integrates image-text models with SAM2 segmentation to generate text-aligned region tokens that preserve open-vocabulary capabilities.

Result: Consistently achieves superior or competitive performance compared to state-of-the-art training-free methods on open-world semantic segmentation, referring expression comprehension, and grounding tasks.

Conclusion: TextRegion provides an effective, practical, and extensible solution for detailed visual understanding by leveraging complementary strengths of image-text and segmentation models.

Abstract: Image-text models excel at image-level tasks but struggle with detailed visual understanding. While these models provide strong visual-language alignment, segmentation models like SAM2 offer precise spatial boundaries for objects. To this end, we propose TextRegion, a simple, effective, and training-free framework that combines the strengths of image-text models and SAM2 to generate powerful text-aligned region tokens. These tokens enable detailed visual understanding while preserving open-vocabulary capabilities. They can be directly applied to various downstream tasks, including open-world semantic segmentation, referring expression comprehension, and grounding. We conduct extensive evaluations and consistently achieve superior or competitive performance compared to state-of-the-art training-free methods. Additionally, our framework is compatible with many image-text models, making it highly practical and easily extensible as stronger models emerge. Code is available at: https://github.com/avaxiao/TextRegion.

[171] UMA: Ultra-detailed Human Avatars via Multi-level Surface Alignment

Heming Zhu, Guoxing Sun, Christian Theobalt, Marc Habermann

Main category: cs.CV

TL;DR: Proposes a method to create high-fidelity animatable human avatars from multi-view videos using 2D point trackers to guide 3D deformation, addressing geometric misalignment issues that limit detail preservation in existing methods.

Details

Motivation: Current animatable avatar methods using implicit representations attached to drivable human templates fail to preserve fine details at high resolutions due to inaccurate surface tracking, depth misalignment, and surface drift between geometry and ground truth.

Method: Uses a latent deformation model supervised by foundational 2D video point trackers for improved robustness, with a cascaded training strategy that generates consistent 3D point tracks anchored to the rendered avatar to supervise at vertex and texel level.

Result: Significantly improved rendering quality and geometric accuracy over prior state-of-the-art methods, validated on a novel dataset with challenging clothing textures and wrinkle deformations captured using 40 calibrated 6K-resolution cameras.

Conclusion: The approach successfully addresses geometric misalignment issues in animatable avatar creation by leveraging 2D point trackers and cascaded training, enabling preservation of fine details at high resolutions.

Abstract: Learning an animatable and clothed human avatar model with vivid dynamics and photorealistic appearance from multi-view videos is an important foundational research problem in computer graphics and vision. Fueled by recent advances in implicit representations, the quality of the animatable avatars has achieved an unprecedented level by attaching the implicit representation to drivable human template meshes. However, they usually fail to preserve the highest level of detail, particularly apparent when the virtual camera is zoomed in and when rendering at 4K resolution and higher. We argue that this limitation stems from inaccurate surface tracking, specifically, depth misalignment and surface drift between character geometry and the ground truth surface, which forces the detailed appearance model to compensate for geometric errors. To address this, we propose a latent deformation model and supervising the 3D deformation of the animatable character using guidance from foundational 2D video point trackers, which offer improved robustness to shading and surface variations, and are less prone to local minima than differentiable rendering. To mitigate the drift over time and lack of 3D awareness of 2D point trackers, we introduce a cascaded training strategy that generates consistent 3D point tracks by anchoring point tracks to the rendered avatar, which ultimately supervises our avatar at the vertex and texel level. To validate the effectiveness of our approach, we introduce a novel dataset comprising five multi-view video sequences, each over 10 minutes in duration, captured using 40 calibrated 6K-resolution cameras, featuring subjects dressed in clothing with challenging texture patterns and wrinkle deformations. Our approach demonstrates significantly improved performance in rendering quality and geometric accuracy over the prior state of the art.

[172] MIND: Material Interface Generation from UDFs for Non-Manifold Surface Reconstruction

Xuhui Chen, Fei Hou, Wencheng Wang, Hong Qin, Ying He

Main category: cs.CV

TL;DR: MIND enables direct non-manifold mesh extraction from unsigned distance fields (UDFs) by creating material interfaces through spatial partitioning, overcoming limitations of local SDF conversion methods.

Details

Motivation: Existing methods for extracting meshes from UDFs often convert them to signed distance fields locally, which introduces topological artifacts and fails to represent non-manifold geometry.

Method: Compute a two-signed local field for manifold patches, extend to multi-labeled global field for non-manifold structures, and combine with UDF to construct material interfaces for multi-labeled Marching Cubes extraction.

Result: Robustly handles complex non-manifold surfaces and significantly outperforms existing methods across diverse data sources including point cloud reconstruction, multi-view reconstruction, and medial axis transforms.

Conclusion: MIND provides an effective solution for non-manifold mesh extraction directly from UDFs, addressing fundamental limitations of prior approaches while maintaining global perspective.

Abstract: Unsigned distance fields (UDFs) are widely used in 3D deep learning due to their ability to represent shapes with arbitrary topology. While prior work has largely focused on learning UDFs from point clouds or multi-view images, extracting meshes from UDFs remains challenging, as the learned fields rarely attain exact zero distances. A common workaround is to reconstruct signed distance fields (SDFs) locally from UDFs to enable surface extraction via Marching Cubes. However, this often introduces topological artifacts such as holes or spurious components. Moreover, local SDFs are inherently incapable of representing non-manifold geometry, leading to complete failure in such cases. To address this gap, we propose MIND (Material Interface from Non-manifold Distance fields), a novel algorithm for generating material interfaces directly from UDFs, enabling non-manifold mesh extraction from a global perspective. The core of our method lies in deriving a meaningful spatial partitioning from the UDF, where the target surface emerges as the interface between distinct regions. We begin by computing a two-signed local field to distinguish the two sides of manifold patches, and then extend this to a multi-labeled global field capable of separating all sides of a non-manifold structure. By combining this multi-labeled field with the input UDF, we construct material interfaces that support non-manifold mesh extraction via a multi-labeled Marching Cubes algorithm. Extensive experiments on UDFs generated from diverse data sources, including point cloud reconstruction, multi-view reconstruction, and medial axis transforms, demonstrate that our approach robustly handles complex non-manifold surfaces and significantly outperforms existing methods. The source code is available at https://github.com/jjjkkyz/MIND.

[173] HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model

Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilcahe Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang

Main category: cs.CV

TL;DR: HoliSafe introduces a comprehensive safety dataset and benchmark for VLMs covering all five safe/unsafe image-text combinations, plus a modular visual guard module (VGM) that enhances safety through interpretable harmfulness classification.

Details

Motivation: Current VLM safety approaches have two main shortcomings: limited coverage of harmful image-text interactions that leaves models vulnerable to jailbreak attacks, and over-reliance on data-centric tuning without architectural innovations for intrinsic safety.

Method: Proposes HoliSafe dataset/benchmark covering all five safe/unsafe image-text combinations, and a modular visual guard module (VGM) that assesses image harmfulness and provides interpretable refusal justifications. The VGM is designed as a plug-in component for diverse VLMs.

Result: Safe-VLM with VGM trained on HoliSafe achieves state-of-the-art safety performance across multiple benchmarks. HoliSafe-Bench reveals critical vulnerabilities in existing VLM models.

Conclusion: HoliSafe and VGM provide a foundation for robust and interpretable VLM safety, enabling modular integration across diverse models and expanding avenues for multimodal alignment research.

Abstract: Despite emerging efforts to enhance the safety of Vision-Language Models (VLMs), current approaches face two main shortcomings. 1) Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content, often overlooking contextually unsafe outcomes from seemingly benign pairs. This narrow coverage leaves VLMs vulnerable to jailbreak attacks in unseen configurations. 2) Prior methods rely primarily on data-centric tuning, with limited architectural innovations to intrinsically strengthen safety. We address these gaps by introducing a holistic safety dataset and benchmark, \textbf{HoliSafe}, that spans all five safe/unsafe image-text combinations, providing a more robust basis for both training and evaluation (HoliSafe-Bench). We further propose a novel modular framework for enhancing VLM safety with a visual guard module (VGM) designed to assess the harmfulness of input images for VLMs. This module endows VLMs with a dual functionality: they not only learn to generate safer responses but can also provide an interpretable harmfulness classification to justify their refusal decisions. A significant advantage of this approach is its modularity; the VGM is designed as a plug-in component, allowing for seamless integration with diverse pre-trained VLMs across various scales. Experiments show that Safe-VLM with VGM, trained on our HoliSafe, achieves state-of-the-art safety performance across multiple VLM benchmarks. Additionally, the HoliSafe-Bench itself reveals critical vulnerabilities in existing VLM models. We hope that HoliSafe and VGM will spur further research into robust and interpretable VLM safety, expanding future avenues for multimodal alignment.

[174] AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Zewei Zhou, Tianhui Cai, Seth Z. Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, Jiaqi Ma

Main category: cs.CV

TL;DR: AutoVLA is a Vision-Language-Action model that unifies reasoning and action generation for autonomous driving using autoregressive generation, tokenized trajectories, and dual thinking modes with reinforcement fine-tuning.

Details

Motivation: Current VLA models struggle with physically infeasible actions, complex structures, and unnecessarily long reasoning for autonomous driving applications.

Method: Tokenizes continuous trajectories into discrete actions, uses supervised fine-tuning with dual thinking modes (fast trajectory-only and slow chain-of-thought), and applies reinforcement fine-tuning with GRPO to reduce unnecessary reasoning.

Result: Demonstrates competitive performance across nuPlan, nuScenes, Waymo, and CARLA datasets in both open-loop and closed-loop settings with adaptive reasoning and accurate planning.

Conclusion: AutoVLA effectively integrates reasoning and action generation for end-to-end autonomous driving while maintaining efficiency through adaptive reasoning strategies.

Abstract: Recent advancements in Vision-Language-Action (VLA) models have shown promise for end-to-end autonomous driving by leveraging world knowledge and reasoning capabilities. However, current VLA models often struggle with physically infeasible action outputs, complex model structures, or unnecessarily long reasoning. In this paper, we propose AutoVLA, a novel VLA model that unifies reasoning and action generation within a single autoregressive generation model for end-to-end autonomous driving. AutoVLA performs semantic reasoning and trajectory planning directly from raw visual inputs and language instructions. We tokenize continuous trajectories into discrete, feasible actions, enabling direct integration into the language model. For training, we employ supervised fine-tuning to equip the model with dual thinking modes: fast thinking (trajectory-only) and slow thinking (enhanced with chain-of-thought reasoning). To further enhance planning performance and efficiency, we introduce a reinforcement fine-tuning method based on Group Relative Policy Optimization (GRPO), reducing unnecessary reasoning in straightforward scenarios. Extensive experiments across real-world and simulated datasets and benchmarks, including nuPlan, nuScenes, Waymo, and CARLA, demonstrate the competitive performance of AutoVLA in both open-loop and closed-loop settings. Qualitative results showcase the adaptive reasoning and accurate planning capabilities of AutoVLA in diverse scenarios.

[175] Advanced Sign Language Video Generation with Compressed and Quantized Multi-Condition Tokenization

Cong Wang, Zexuan Deng, Zhiwei Jiang, Yafeng Yin, Fei Shen, Zifeng Cheng, Shiping Ge, Shiwei Gan, Qing Gu

Main category: cs.CV

TL;DR: SignViP is a novel Sign Language Video Generation framework that uses multiple fine-grained conditions (poses and 3D hands) via discrete tokenization to generate more natural and expressive sign language videos from spoken language texts.

Details

Motivation: Existing methods rely on single coarse conditions like skeleton sequences, which limit the naturalness and expressiveness of generated sign language videos.

Method: Proposes a three-component framework: 1) Sign Video Diffusion Model with multi-condition encoder, 2) FSQ Autoencoder for discrete tokenization, 3) Multi-Condition Token Translator to convert text to tokens.

Result: Achieves state-of-the-art performance across video quality, temporal coherence, and semantic fidelity metrics.

Conclusion: SignViP demonstrates that incorporating multiple fine-grained conditions through discrete tokenization significantly improves sign language video generation quality and expressiveness.

Abstract: Sign Language Video Generation (SLVG) seeks to generate identity-preserving sign language videos from spoken language texts. Existing methods primarily rely on the single coarse condition (\eg, skeleton sequences) as the intermediary to bridge the translation model and the video generation model, which limits both the naturalness and expressiveness of the generated videos. To overcome these limitations, we propose SignViP, a novel SLVG framework that incorporates multiple fine-grained conditions for improved generation fidelity. Rather than directly translating error-prone high-dimensional conditions, SignViP adopts a discrete tokenization paradigm to integrate and represent fine-grained conditions (\ie, fine-grained poses and 3D hands). SignViP contains three core components. (1) Sign Video Diffusion Model is jointly trained with a multi-condition encoder to learn continuous embeddings that encapsulate fine-grained motion and appearance. (2) Finite Scalar Quantization (FSQ) Autoencoder is further trained to compress and quantize these embeddings into discrete tokens for compact representation of the conditions. (3) Multi-Condition Token Translator is trained to translate spoken language text to discrete multi-condition tokens. During inference, Multi-Condition Token Translator first translates the spoken language text into discrete multi-condition tokens. These tokens are then decoded to continuous embeddings by FSQ Autoencoder, which are subsequently injected into Sign Video Diffusion Model to guide video generation. Experimental results show that SignViP achieves state-of-the-art performance across metrics, including video quality, temporal coherence, and semantic fidelity. The code is available at https://github.com/umnooob/signvip/.

[176] Seeing What Matters: Generalizable AI-generated Video Detection with Forensic-Oriented Augmentation

Riccardo Corvi, Davide Cozzolino, Ekta Prashnani, Shalini De Mello, Koki Nagano, Luisa Verdoliva

Main category: cs.CV

TL;DR: A novel video forensic method that improves generalization by focusing on intrinsic generative artifacts rather than model-specific flaws, using wavelet-based data augmentation to train detectors that work across multiple AI video generators.

Details

Motivation: Current AI-generated video detectors have poor generalization across different generative models, limiting real-world applicability. The key insight is to focus on intrinsic low-level artifacts rather than high-level semantic flaws specific to individual models.

Method: Study generative architectures to identify robust discriminative features, then introduce forensic-oriented data augmentation using wavelet decomposition to replace specific frequency bands, forcing the model to learn more relevant forensic cues.

Result: The method achieves significant accuracy improvement over state-of-the-art detectors, with excellent performance even on recent models like NOVA and FLUX, despite training on data from only a single generative model.

Conclusion: The proposed training paradigm improves detector generalizability without needing complex algorithms or large multi-generator datasets, making AI-generated video detection more practical for real-world scenarios.

Abstract: Synthetic video generation is progressing very rapidly. The latest models can produce very realistic high-resolution videos that are virtually indistinguishable from real ones. Although several video forensic detectors have been recently proposed, they often exhibit poor generalization, which limits their applicability in a real-world scenario. Our key insight to overcome this issue is to guide the detector towards seeing what really matters. In fact, a well-designed forensic classifier should focus on identifying intrinsic low-level artifacts introduced by a generative architecture rather than relying on high-level semantic flaws that characterize a specific model. In this work, first, we study different generative architectures, searching and identifying discriminative features that are unbiased, robust to impairments, and shared across models. Then, we introduce a novel forensic-oriented data augmentation strategy based on the wavelet decomposition and replace specific frequency-related bands to drive the model to exploit more relevant forensic cues. Our novel training paradigm improves the generalizability of AI-generated video detectors, without the need for complex algorithms and large datasets that include multiple synthetic generators. To evaluate our approach, we train the detector using data from a single generative model and test it against videos produced by a wide range of other models. Despite its simplicity, our method achieves a significant accuracy improvement over state-of-the-art detectors and obtains excellent results even on very recent generative models, such as NOVA and FLUX.

[177] Med-GLIP: Advancing Medical Language-Image Pre-training with Large-scale Grounded Dataset

Ziye Deng, Ruihan He, Jiaxiang Liu, Yuan Wang, Zijie Meng, Songtao Jiang, Yong Xie, Zuozhu Liu

Main category: cs.CV

TL;DR: Med-GLIP is a modality-aware medical image grounding framework trained on a large-scale dataset (Med-GLIP-5M) with 5.3M region-level annotations across 7 imaging modalities, enabling hierarchical semantic understanding without expert modules and improving performance in grounding, VQA, and report generation tasks.

Details

Motivation: Existing medical image grounding research faces limitations including limited modality coverage, coarse-grained annotations, and lack of unified generalizable frameworks, hindering intelligent diagnosis and automated medical applications.

Method: Constructed Med-GLIP-5M dataset with hierarchical region labels, then proposed Med-GLIP framework that implicitly learns hierarchical semantic understanding from diverse training data without explicitly designed expert modules.

Result: Med-GLIP consistently outperforms state-of-the-art baselines across multiple grounding benchmarks and provides substantial performance gains when integrated into downstream tasks like medical VQA and report generation.

Conclusion: The proposed Med-GLIP framework and Med-GLIP-5M dataset effectively address limitations in medical image grounding, enabling hierarchical semantic understanding and improving performance across various medical AI applications.

Abstract: Medical image grounding aims to align natural language phrases with specific regions in medical images, serving as a foundational task for intelligent diagnosis, visual question answering (VQA), and automated report generation (MRG). However, existing research is constrained by limited modality coverage, coarse-grained annotations, and the absence of a unified, generalizable grounding framework. To address these challenges, we construct a large-scale medical grounding dataset Med-GLIP-5M comprising over 5.3 million region-level annotations across seven imaging modalities, covering diverse anatomical structures and pathological findings. The dataset supports both segmentation and grounding tasks with hierarchical region labels, ranging from organ-level boundaries to fine-grained lesions. Based on this foundation, we propose Med-GLIP, a modality-aware grounding framework trained on Med-GLIP-5M. Rather than relying on explicitly designed expert modules, Med-GLIP implicitly acquires hierarchical semantic understanding from diverse training data – enabling it to recognize multi-granularity structures, such as distinguishing lungs from pneumonia lesions. Extensive experiments demonstrate that Med-GLIP consistently outperforms state-of-the-art baselines across multiple grounding benchmarks. Furthermore, integrating its spatial outputs into downstream tasks, including medical VQA and report generation, leads to substantial performance gains. Our dataset will be released soon.

[178] MCTED: A Machine-Learning-Ready Dataset for Digital Elevation Model Generation From Mars Imagery

Rafał Osadnik, Pablo Gómez, Eleni Bohacek, Rickbir Bahia

Main category: cs.CV

TL;DR: This paper introduces MCTED, a new dataset for Martian digital elevation model prediction using Mars Reconnaissance Orbiter CTX data, consisting of 80,898 samples with optical image patches, DEM patches, and mask patches to handle missing data.

Details

Motivation: To address the challenges of artefacts and missing data in large-scale Martian DEMs and provide a high-quality dataset specifically designed for machine learning applications in Martian terrain analysis.

Method: Developed a comprehensive pipeline to process high-resolution Mars orthoimage and DEM pairs, created tools to mitigate data artefacts, and organized samples into training/validation splits without mutual area coverage to prevent data leakage.

Result: Generated a dataset of 80,898 samples covering diverse Martian terrain, with statistical insights on spatial distribution, elevation values, and slopes. A small U-Net trained on MCTED outperformed DepthAnythingV2’s zero-shot performance on elevation prediction.

Conclusion: Specialized training on domain-specific datasets like MCTED yields better performance than general foundation models, and the dataset provides valuable resources for Martian terrain analysis with complete open-source availability.

Abstract: This work presents a new dataset for the Martian digital elevation model prediction task, ready for machine learning applications called MCTED. The dataset has been generated using a comprehensive pipeline designed to process high-resolution Mars orthoimage and DEM pairs from Day et al., yielding a dataset consisting of 80,898 data samples. The source images are data gathered by the Mars Reconnaissance Orbiter using the CTX instrument, providing a very diverse and comprehensive coverage of the Martian surface. Given the complexity of the processing pipelines used in large-scale DEMs, there are often artefacts and missing data points in the original data, for which we developed tools to solve or mitigate their impact. We divide the processed samples into training and validation splits, ensuring samples in both splits cover no mutual areas to avoid data leakage. Every sample in the dataset is represented by the optical image patch, DEM patch, and two mask patches, indicating values that were originally missing or were altered by us. This allows future users of the dataset to handle altered elevation regions as they please. We provide statistical insights of the generated dataset, including the spatial distribution of samples, the distributions of elevation values, slopes and more. Finally, we train a small U-Net architecture on the MCTED dataset and compare its performance to a monocular depth estimation foundation model, DepthAnythingV2, on the task of elevation prediction. We find that even a very small architecture trained on this dataset specifically, beats a zero-shot performance of a depth estimation foundation model like DepthAnythingV2. We make the dataset and code used for its generation completely open source in public repositories.

[179] Zero-Shot Referring Expression Comprehension via Vison-Language True/False Verification

Jeffrey Liu, Rongbin Hu

Main category: cs.CV

TL;DR: A zero-shot workflow for Referring Expression Comprehension that reformulates the task as box-wise visual-language verification achieves competitive performance without any REC-specific training.

Details

Motivation: To show that workflow design rather than task-specific pretraining can drive strong zero-shot performance in REC, reducing the need for specialized training.

Method: Reformulates REC as box-wise visual-language verification using proposals from a generic detector (YOLO-World), where a general-purpose VLM independently answers True/False queries for each region.

Result: Surpasses zero-shot GroundingDINO baseline and exceeds reported results for trained GroundingDINO variants on RefCOCO, RefCOCO+, and RefCOCOg datasets.

Conclusion: Workflow design is more important than task-specific pretraining for achieving strong zero-shot REC performance, with verification-based approach outperforming selection-based prompting.

Abstract: Referring Expression Comprehension (REC) is usually addressed with task-trained grounding models. We show that a zero-shot workflow, without any REC-specific training, can achieve competitive or superior performance. Our approach reformulates REC as box-wise visual-language verification: given proposals from a COCO-clean generic detector (YOLO-World), a general-purpose VLM independently answers True/False queries for each region. This simple procedure reduces cross-box interference, supports abstention and multiple matches, and requires no fine-tuning. On RefCOCO, RefCOCO+, and RefCOCOg, our method not only surpasses a zero-shot GroundingDINO baseline but also exceeds reported results for GroundingDINO trained on REC and GroundingDINO+CRG. Controlled studies with identical proposals confirm that verification significantly outperforms selection-based prompting, and results hold with open VLMs. Overall, we show that workflow design, rather than task-specific pretraining, drives strong zero-shot REC performance.

[180] Comparing Computational Pathology Foundation Models using Representational Similarity Analysis

Vaibhav Mishra, William Lotter

Main category: cs.CV

TL;DR: Analysis of six computational pathology foundation models reveals distinct representational structures, with UNI2 and Virchow2 being most unique, high slide-dependence but low disease-dependence, and stain normalization reducing slide-specific variability.

Details

Motivation: To systematically analyze the representational spaces of computational pathology foundation models, as while task performance has been evaluated, less is known about the structure and variability of their learned representations.

Method: Used representational similarity analysis on H&E image patches from TCGA, comparing six models spanning vision-language contrastive learning (CONCH, PLIP, KEEP) and self-distillation (UNI v2, Virchow v2, Prov-GigaPath) approaches.

Result: UNI2 and Virchow2 had most distinct representations; Prov-GigaPath had highest average similarity; same training paradigm didn’t guarantee higher similarity; high slide-dependence but low disease-dependence; stain normalization reduced slide-dependence by 5.5-20.5%; vision-language models had more compact representations than vision-only models.

Conclusion: The findings highlight opportunities to improve robustness to slide-specific features, inform model ensembling strategies, and provide insights into how training paradigms shape model representations, with extendable framework across medical imaging domains.

Abstract: Foundation models are increasingly developed in computational pathology (CPath) given their promise in facilitating many downstream tasks. While recent studies have evaluated task performance across models, less is known about the structure and variability of their learned representations. Here, we systematically analyze the representational spaces of six CPath foundation models using techniques popularized in computational neuroscience. The models analyzed span vision-language contrastive learning (CONCH, PLIP, KEEP) and self-distillation (UNI (v2), Virchow (v2), Prov-GigaPath) approaches. Through representational similarity analysis using H&E image patches from TCGA, we find that UNI2 and Virchow2 have the most distinct representational structures, whereas Prov-Gigapath has the highest average similarity across models. Having the same training paradigm (vision-only vs. vision-language) did not guarantee higher representational similarity. The representations of all models showed a high slide-dependence, but relatively low disease-dependence. Stain normalization decreased slide-dependence for all models by a range of 5.5% (CONCH) to 20.5% (PLIP). In terms of intrinsic dimensionality, vision-language models demonstrated relatively compact representations, compared to the more distributed representations of vision-only models. These findings highlight opportunities to improve robustness to slide-specific features, inform model ensembling strategies, and provide insights into how training paradigms shape model representations. Our framework is extendable across medical imaging domains, where probing the internal representations of foundation models can support their effective development and deployment.

[181] Hemorica: A Comprehensive CT Scan Dataset for Automated Brain Hemorrhage Classification, Segmentation, and Detection

Kasra Davoodi, Mohammad Hoseyni, Javad Khoramdel, Reza Barati, Reihaneh Mortazavi, Amirhossein Nikoofard, Mahdi Aliyari-Shoorehdeli, Jaber Hatam Parikhan

Main category: cs.CV

TL;DR: Hemorica is a publicly available dataset of 372 head CT scans with comprehensive annotations for five intracranial hemorrhage subtypes, enabling robust AI model development for ICH detection and segmentation.

Details

Motivation: To address the fragmented public data issue hindering robust AI solutions for timely intracranial hemorrhage diagnosis on CT scans.

Method: Created Hemorica dataset with 372 head CT exams (2012-2024) annotated for five ICH subtypes using double-reading workflow with neurosurgeon adjudication. Fine-tuned standard CNN and transformer architectures for binary classification and segmentation tasks.

Result: Lightweight models achieved 87.8% F1 score for binary classification and 85.5% Dice score for lesion segmentation, validating annotation quality and sufficient sample size.

Conclusion: Hemorica provides a unified benchmark supporting multi-task learning, transfer to weakly labeled cohorts, and facilitates AI assistant development for ICH detection and quantification.

Abstract: Timely diagnosis of Intracranial hemorrhage (ICH) on Computed Tomography (CT) scans remains a clinical priority, yet the development of robust Artificial Intelligence (AI) solutions is still hindered by fragmented public data. To close this gap, we introduce Hemorica, a publicly available collection of 372 head CT examinations acquired between 2012 and 2024. Each scan has been exhaustively annotated for five ICH subtypes-epidural (EPH), subdural (SDH), subarachnoid (SAH), intraparenchymal (IPH), and intraventricular (IVH)-yielding patient-wise and slice-wise classification labels, subtype-specific bounding boxes, two-dimensional pixel masks and three-dimensional voxel masks. A double-reading workflow, preceded by a pilot consensus phase and supported by neurosurgeon adjudication, maintained low inter-rater variability. Comprehensive statistical analysis confirms the clinical realism of the dataset. To establish reference baselines, standard convolutional and transformer architectures were fine-tuned for binary slice classification and hemorrhage segmentation. With only minimal fine-tuning, lightweight models such as MobileViT-XS achieved an F1 score of 87.8% in binary classification, whereas a U-Net with a DenseNet161 encoder reached a Dice score of 85.5% for binary lesion segmentation that validate both the quality of the annotations and the sufficiency of the sample size. Hemorica therefore offers a unified, fine-grained benchmark that supports multi-task and curriculum learning, facilitates transfer to larger but weakly labelled cohorts, and facilitates the process of designing an AI-based assistant for ICH detection and quantification systems.

[182] FlexAC: Towards Flexible Control of Associative Reasoning in Multimodal Large Language Models

Shengming Yuan, Xinyu Lyu, Shuailong Wang, Beitao Chen, Jingkuan Song, Lianli Gao

Main category: cs.CV

TL;DR: FlexAC enables flexible control over associative reasoning in MLLMs by modulating middle layer representations using hallucination-guided steering vectors, achieving significant improvements in creativity while reducing hallucinations.

Details

Motivation: MLLMs face a trade-off between faithfulness and creativity, but existing methods lack flexibility to modulate associative reasoning strength for different tasks requiring varying degrees of associative reasoning.

Method: FlexAC uses hallucination-guided intermediate representations to encode associative directions, constructs associative steering vectors from high-association instances, and incorporates task-specific vectors from target-domain samples to enable multi-dimensional associative reasoning.

Result: Achieves up to 5.8x improvement in creativity on Creation-MMBench and 29% reduction in hallucination rate on CHAIR, surpassing existing baselines.

Conclusion: FlexAC provides an effective training-free framework for flexible control over associative reasoning in MLLMs, enabling better adaptation to both factual and creative scenarios.

Abstract: Multimodal large language models (MLLMs) face an inherent trade-off between faithfulness and creativity, as different tasks require varying degrees of associative reasoning. However, existing methods lack the flexibility to modulate this reasoning strength, limiting MLLMs’ adaptability across factual and creative scenarios. To bridge this gap, we propose equipping MLLMs with mechanisms that enable flexible control over associative reasoning. We begin by investigating the internal mechanisms underlying associative behavior in MLLMs and find that: (1) middle layers play a pivotal role in shaping model’s associative tendencies, (2) modifying representations in these layers effectively regulates associative reasoning strength, and (3) hallucinations can be exploited to derive steering vectors that guide this modulation. Building on these findings, we introduce Flexible Association Control (FlexAC), a lightweight and training-free framework for modulating associative behavior in MLLMs. FlexAC first induces hallucination-guided intermediate representations to encode associative directions. Then, it selects high-association instances to construct effective associative steering vectors, whose strengths are adaptively calibrated to balance creative guidance with output stability. Finally, recognizing the multi-dimensional nature of associative reasoning, FlexAC incorporates task-specific associative vectors derived from a forward pass on a few target-domain samples, enabling models to follow diverse associative directions and better adapt to creative tasks. Notably, our method achieves up to a 5.8x improvement in creativity on Creation-MMBench and a 29% reduction in hallucination rate on CHAIR, surpassing existing baselines and demonstrating its effectiveness in enabling flexible control over associative reasoning in MLLMs. Our code is available at https://github.com/ylhz/FlexAC.

[183] RealDPO: Real or Not Real, that is the Preference

Guo Cheng, Danni Yang, Ziqi Huang, Jianlou Si, Chenyang Si, Ziwei Liu

Main category: cs.CV

TL;DR: RealDPO introduces a new alignment method using real-world videos as positive samples for preference learning to improve complex motion synthesis in video generation, outperforming current state-of-the-art models.

Details

Motivation: Existing video generative models struggle with producing natural, smooth, and contextually consistent complex motions, limiting their practical applications in real-world scenarios.

Method: RealDPO uses Direct Preference Optimization (DPO) with a tailored loss function that contrasts real-world videos with erroneous model outputs, enabling iterative self-correction. Also introduces RealAction-5K dataset for post-training.

Result: Extensive experiments show RealDPO significantly improves video quality, text alignment, and motion realism compared to state-of-the-art models and existing preference optimization techniques.

Conclusion: RealDPO effectively addresses the motion synthesis challenge in video generation through real-world data-driven preference learning, enabling more accurate and realistic motion generation.

Abstract: Video generative models have recently achieved notable advancements in synthesis quality. However, generating complex motions remains a critical challenge, as existing models often struggle to produce natural, smooth, and contextually consistent movements. This gap between generated and real-world motions limits their practical applicability. To address this issue, we introduce RealDPO, a novel alignment paradigm that leverages real-world data as positive samples for preference learning, enabling more accurate motion synthesis. Unlike traditional supervised fine-tuning (SFT), which offers limited corrective feedback, RealDPO employs Direct Preference Optimization (DPO) with a tailored loss function to enhance motion realism. By contrasting real-world videos with erroneous model outputs, RealDPO enables iterative self-correction, progressively refining motion quality. To support post-training in complex motion synthesis, we propose RealAction-5K, a curated dataset of high-quality videos capturing human daily activities with rich and precise motion details. Extensive experiments demonstrate that RealDPO significantly improves video quality, text alignment, and motion realism compared to state-of-the-art models and existing preference optimization techniques.

[184] Gestura: A LVLM-Powered System Bridging Motion and Semantics for Real-Time Free-Form Gesture Understanding

Zhuoming Li, Aitong Liu, Mengxi Jia, Yubi Lu, Tengxiang Zhang, Changzhi Sun, Dell Zhang, Xuelong Li

Main category: cs.CV

TL;DR: Gestura is an end-to-end system that improves free-form gesture understanding using a pre-trained LVLM enhanced with anatomical hand priors and CoT reasoning, achieving better accuracy and response times than existing solutions.

Details

Motivation: Existing solutions like GestureGPT have limited recognition accuracy and slow response times for free-form gesture understanding, which is important for natural human-computer interaction.

Method: Uses pre-trained LVLM aligned with gesture semantics, adds Landmark Processing Module with anatomical hand priors for fine-grained movement capture, and implements Chain-of-Thought reasoning for step-by-step semantic inference.

Result: Gestura achieves robust and adaptable free-form gesture comprehension, and the team created the first open-source dataset with over 300,000 annotated QA pairs for gesture intention reasoning.

Conclusion: The combination of LVLM alignment, anatomical hand priors, and CoT reasoning enables significant improvement in interpreting ambiguous or unconventional gestures for free-form gesture understanding.

Abstract: Free-form gesture understanding is highly appealing for human-computer interaction, as it liberates users from the constraints of predefined gesture categories. However, the sole existing solution GestureGPT suffers from limited recognition accuracy and slow response times. In this paper, we propose Gestura, an end-to-end system for free-form gesture understanding. Gestura harnesses a pre-trained Large Vision-Language Model (LVLM) to align the highly dynamic and diverse patterns of free-form gestures with high-level semantic concepts. To better capture subtle hand movements across different styles, we introduce a Landmark Processing Module that compensate for LVLMs’ lack of fine-grained domain knowledge by embedding anatomical hand priors. Further, a Chain-of-Thought (CoT) reasoning strategy enables step-by-step semantic inference, transforming shallow knowledge into deep semantic understanding and significantly enhancing the model’s ability to interpret ambiguous or unconventional gestures. Together, these components allow Gestura to achieve robust and adaptable free-form gesture comprehension. Additionally, we have developed the first open-source dataset for free-form gesture intention reasoning and understanding with over 300,000 annotated QA pairs.

[185] Residual Diffusion Bridge Model for Image Restoration

Hebaixu Wang, Jing Zhang, Haoyang Chen, Haonan Guo, Di Wang, Jiayi Ma, Bo Du

Main category: cs.CV

TL;DR: RDBM is a unified diffusion bridge model that uses residual-based modulation for adaptive image restoration, outperforming existing methods by preserving undegraded regions while restoring degraded areas.

Details

Motivation: Existing diffusion bridge models lack unified analysis and indiscriminately process images, distorting undegraded regions due to global noise injection and removal.

Method: Theoretical reformulation of stochastic differential equations for generalized diffusion bridges, with residual-based modulation to adaptively inject/remove noise only in degraded regions.

Result: RDBM achieves state-of-the-art performance across diverse image restoration tasks, with analytical proofs showing existing bridge models are special cases of RDBM.

Conclusion: RDBM provides a unified framework for diffusion bridge models with adaptive restoration capabilities, demonstrating optimal performance and broad applicability in image restoration.

Abstract: Diffusion bridge models establish probabilistic paths between arbitrary paired distributions and exhibit great potential for universal image restoration. Most existing methods merely treat them as simple variants of stochastic interpolants, lacking a unified analytical perspective. Besides, they indiscriminately reconstruct images through global noise injection and removal, inevitably distorting undegraded regions due to imperfect reconstruction. To address these challenges, we propose the Residual Diffusion Bridge Model (RDBM). Specifically, we theoretically reformulate the stochastic differential equations of generalized diffusion bridge and derive the analytical formulas of its forward and reverse processes. Crucially, we leverage the residuals from given distributions to modulate the noise injection and removal, enabling adaptive restoration of degraded regions while preserving intact others. Moreover, we unravel the fundamental mathematical essence of existing bridge models, all of which are special cases of RDBM and empirically demonstrate the optimality of our proposed models. Extensive experiments are conducted to demonstrate the state-of-the-art performance of our method both qualitatively and quantitatively across diverse image restoration tasks. Code is publicly available at https://github.com/MiliLab/RDBM.

[186] BasicAVSR: Arbitrary-Scale Video Super-Resolution via Image Priors and Enhanced Motion Compensation

Wei Shang, Wanying Zhang, Shuhang Gu, Pengfei Zhu, Qinghua Hu, Dongwei Ren

Main category: cs.CV

TL;DR: BasicAVSR is a strong baseline for arbitrary-scale video super-resolution that integrates adaptive multi-scale frequency priors, flow-guided propagation, second-order motion compensation, and hyper-upsampling to achieve superior performance across different scenarios.

Details

Motivation: Arbitrary-scale video super-resolution faces challenges in spatial detail reproduction, temporal consistency, and computational complexity, requiring a flexible solution that can handle various scaling factors and application scenarios.

Method: Proposed BasicAVSR integrates four key components: 1) adaptive multi-scale frequency priors from image Laplacian pyramids, 2) flow-guided propagation unit for spatiotemporal aggregation, 3) second-order motion compensation for accurate frame alignment, and 4) hyper-upsampling unit for scale-aware upsampling kernels. Three propagation variants are instantiated for different use cases.

Result: Experimental results demonstrate BasicAVSR significantly outperforms existing methods in super-resolution quality, generalization ability, and inference speed across different scenarios.

Conclusion: BasicAVSR advances the state-of-the-art in arbitrary-scale video super-resolution and extends its core components to multiple frameworks for diverse application scenarios.

Abstract: Arbitrary-scale video super-resolution (AVSR) aims to enhance the resolution of video frames, potentially at various scaling factors, which presents several challenges regarding spatial detail reproduction, temporal consistency, and computational complexity. In this paper, we propose a strong baseline BasicAVSR for AVSR by integrating four key components: 1) adaptive multi-scale frequency priors generated from image Laplacian pyramids, 2) a flow-guided propagation unit to aggregate spatiotemporal information from adjacent frames, 3) a second-order motion compensation unit for more accurate spatial alignment of adjacent frames, and 4) a hyper-upsampling unit to generate scale-aware and content-independent upsampling kernels. To meet diverse application demands, we instantiate three propagation variants: (i) a unidirectional RNN unit for strictly online inference, (ii) a unidirectional RNN unit empowered with a limited lookahead that tolerates a small output delay, and (iii) a bidirectional RNN unit designed for offline tasks where computational resources are less constrained. Experimental results demonstrate the effectiveness and adaptability of our model across these different scenarios. Through extensive experiments, we show that BasicAVSR significantly outperforms existing methods in terms of super-resolution quality, generalization ability, and inference speed. Our work not only advances the state-of-the-art in AVSR but also extends its core components to multiple frameworks for diverse scenarios. The code is available at https://github.com/shangwei5/BasicAVSR.

[187] OmniVLA: Physically-Grounded Multimodal VLA with Unified Multi-Sensor Perception for Robotic Manipulation

Heyu Guo, Shanmu Wang, Ruichun Ma, Shiqi Jiang, Yasaman Ghasempour, Omid Abari, Baining Guo, Lili Qiu

Main category: cs.CV

TL;DR: OmniVLA is an omni-modality vision-language-action model that integrates multiple sensing modalities beyond RGB, achieving 84% task success rate and significantly outperforming RGB-only and raw-sensor baselines.

Details

Motivation: Most existing VLA models rely solely on RGB cameras, limiting perception and manipulation capabilities. The authors aim to enhance robotic action prediction by incorporating novel sensing modalities for physically-grounded spatial intelligence.

Method: Uses sensor-masked images - a unified representation overlaying physically meaningful masks from infrared camera, mmWave radar, and microphone array onto RGB images. Built on RGB-pretrained VLA backbone with lightweight per-sensor projectors for data-efficient learning.

Result: Achieves 84% average task success rate, outperforming RGB-only baselines by 59% and raw-sensor-input baselines by 28%. Shows higher learning efficiency and stronger generalization capability.

Conclusion: OmniVLA demonstrates that integrating multiple sensing modalities through sensor-masked images significantly enhances robotic manipulation performance while maintaining training efficiency and generalization.

Abstract: Vision-language-action (VLA) models have shown strong generalization for robotic action prediction through large-scale vision-language pretraining. However, most existing models rely solely on RGB cameras, limiting their perception and, consequently, manipulation capabilities. We present OmniVLA, an omni-modality VLA model that integrates novel sensing modalities for physically-grounded spatial intelligence beyond RGB perception. The core of our approach is the sensor-masked image, a unified representation that overlays spatially grounded and physically meaningful masks onto the RGB images, derived from sensors including an infrared camera, a mmWave radar, and a microphone array. This image-native unification keeps sensor input close to RGB statistics to facilitate training, provides a uniform interface across sensor hardware, and enables data-efficient learning with lightweight per-sensor projectors. Built on this, we present a multisensory vision-language-action model architecture and train the model based on an RGB-pretrained VLA backbone. We evaluate OmniVLA on challenging real-world tasks where sensor-modality perception guides the robotic manipulation. OmniVLA achieves an average task success rate of 84%, significantly outperforms both RGB-only and raw-sensor-input baseline models by 59% and 28% respectively, meanwhile showing higher learning efficiency and stronger generalization capability.

[188] Source-Only Cross-Weather LiDAR via Geometry-Aware Point Drop

YoungJae Cheong, Jhonghyun An

Main category: cs.CV

TL;DR: A Light Geometry-aware adapter improves LiDAR semantic segmentation robustness in adverse weather by aligning azimuth, applying circular padding, and using local geometry cues for region-aware regularization.

Details

Motivation: LiDAR semantic segmentation degrades in adverse weather due to geometry corruption from refraction, scattering, and point dropouts. Prior methods overlook structural vulnerabilities near boundaries, corners, and sparse regions.

Method: Proposes a plug-and-play adapter that aligns azimuth, applies horizontal circular padding, uses local-window K-Nearest Neighbors to compute geometry statistics, and drives region-aware regularization during training only.

Result: Improves mIoU by 7.9 percentage points over data-centric augmentation baseline and by 0.6 points over class-centric regularization baseline in cross-weather evaluation on SemanticSTF.

Conclusion: Geometry-driven regularization is a key direction for all-weather LiDAR segmentation, with the adapter providing significant robustness improvements at negligible inference cost.

Abstract: LiDAR semantic segmentation degrades in adverse weather because refraction, scattering, and point dropouts corrupt geometry. Prior work in weather simulation, mixing-based augmentation, domain randomization, and uncertainty or boundary regularization improves robustness but still overlooks structural vulnerabilities near boundaries, corners, and sparse regions. We present a Light Geometry-aware adapter. The module aligns azimuth and applies horizontal circular padding to preserve neighbor continuity across the 0~360 degree wrap-around boundary. A local-window K-Nearest Neighbors gathers nearby points and computes simple local statistics, which are compressed into compact geometry-aware cues. During training, these cues drive region-aware regularization that stabilizes predictions in structurally fragile areas. The adapter is plug and play, complements augmentation, and can be enabled only during training with negligible inference cost. We adopt a source-only cross-weather setup where models train on SemanticKITTI and are evaluated on SemanticSTF without target labels or fine-tuning. The adapter improves mIoU by 7.9 percentage points over the data-centric augmentation baseline and by 0.6 points over the class-centric regularization baseline. These results indicate that geometry-driven regularization is a key direction for all-weather LiDAR segmentation.

[189] TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning

Ming Li, Jike Zhong, Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Yuxiang Lai, Chen Wei, Konstantinos Psounis, Kaipeng Zhang

Main category: cs.CV

TL;DR: TIR-Bench is a new benchmark for evaluating agentic thinking-with-images capabilities across 13 diverse tasks requiring novel tool use for image processing in chain-of-thought reasoning.

Details

Motivation: Existing benchmarks like Visual Search fail to capture advanced thinking-with-images capabilities, testing only basic operations and offering little insight into complex, dynamic, tool-dependent reasoning.

Method: Introduced TIR-Bench benchmark with 13 diverse tasks requiring novel tool use for image processing and manipulation in chain-of-thought. Evaluated 22 multimodal LLMs including open-source, proprietary, and tool-use augmented models.

Result: TIR-Bench is universally challenging, and strong performance requires genuine thinking-with-images capabilities. Results show current models struggle with advanced visual reasoning tasks.

Conclusion: The benchmark effectively measures agentic thinking-with-images capabilities, and a pilot study compares direct versus agentic fine-tuning approaches.

Abstract: The frontier of visual reasoning is shifting toward models like OpenAI o3, which can intelligently create and operate tools to transform images for problem-solving, also known as thinking-\textit{with}-images in chain-of-thought. Yet existing benchmarks fail to fully capture this advanced capability. Even Visual Search, the most common benchmark for current thinking-\textit{with}-images methods, tests only basic operations such as localization and cropping, offering little insight into more complex, dynamic, and tool-dependent reasoning. We introduce \textbf{TIR-Bench}, a comprehensive benchmark for evaluating agentic thinking-with-images across 13 diverse tasks, each requiring novel tool use for image processing and manipulation in chain-of-thought. We evaluate 22 multimodal large language models (MLLMs), from leading open-sourced and proprietary models to those with explicit tool-use augmentation. Results show that TIR-Bench is universally challenging, and strong performance requires genuine thinking-with-images capabilities. Finally, we present a pilot study comparing direct versus agentic fine-tuning.

[190] Assessing the value of Geo-Foundational Models for Flood Inundation Mapping: Benchmarking models for Sentinel-1, Sentinel-2, and Planetscope for end-users

Saurabh Kaushik, Lalit Maurya, Elizabeth Tellman, ZhiJie Zhang

Main category: cs.CV

TL;DR: Geo-Foundational Models (GFMs) show competitive performance for flood mapping across different satellite sensors, with Clay emerging as the best overall performer due to better accuracy, computational efficiency, and few-shot learning capabilities compared to traditional models like U-Net.

Details

Motivation: There is a lack of systematic comparison between GFMs and traditional models for flood inundation mapping across different sensors and data availability scenarios, which is essential to guide end-users in model selection.

Method: Evaluated three GFMs (Prithvi 2.0, Clay V1.5, DOFA) and UViT against traditional models (TransNorm, U-Net, Attention U-Net) using PlanetScope, Sentinel-1, and Sentinel-2 data. Conducted leave-one-region-out cross-validation across five regions and few-shot experiments.

Result: GFMs show competitive performance with only 2-5% variation between best and worst models. Clay outperforms others on PlanetScope (0.79 mIoU) and Sentinel-2 (0.70), while Prithvi leads on Sentinel-1 (0.57). Clay shows 4% improvement over U-Net and superior few-shot performance (0.64 mIoU with 5 images vs 0.24 for Prithvi). Clay is computationally efficient with 26M parameters, making it 3x faster than Prithvi.

Conclusion: GFMs offer small to moderate improvements in flood mapping accuracy at lower computational cost and labeling effort compared to traditional U-Net, with Clay being the most practical choice due to its balanced performance, efficiency, and few-shot learning capabilities.

Abstract: Geo-Foundational Models (GFMs) enable fast and reliable extraction of spatiotemporal information from satellite imagery, improving flood inundation mapping by leveraging location and time embeddings. Despite their potential, it remains unclear whether GFMs outperform traditional models like U-Net. A systematic comparison across sensors and data availability scenarios is still lacking, which is an essential step to guide end-users in model selection. To address this, we evaluate three GFMs, Prithvi 2.0, Clay V1.5, DOFA, and UViT (a Prithvi variant), against TransNorm, U-Net, and Attention U-Net using PlanetScope, Sentinel-1, and Sentinel-2. We observe competitive performance among all GFMs, with only 2-5% variation between the best and worst models across sensors. Clay outperforms others on PlanetScope (0.79 mIoU) and Sentinel-2 (0.70), while Prithvi leads on Sentinel-1 (0.57). In leave-one-region-out cross-validation across five regions, Clay shows slightly better performance across all sensors (mIoU: 0.72(0.04), 0.66(0.07), 0.51(0.08)) compared to Prithvi (0.70(0.05), 0.64(0.09), 0.49(0.13)) and DOFA (0.67(0.07), 0.64(0.04), 0.49(0.09)) for PlanetScope, Sentinel-2, and Sentinel-1, respectively. Across all 19 sites, leave-one-region-out cross-validation reveals a 4% improvement by Clay compared to U-Net. Visual inspection highlights Clay’s superior ability to retain fine details. Few-shot experiments show Clay achieves 0.64 mIoU on PlanetScope with just five training images, outperforming Prithvi (0.24) and DOFA (0.35). In terms of computational time, Clay is a better choice due to its smaller model size (26M parameters), making it ~3x faster than Prithvi (650M) and 2x faster than DOFA (410M). Contrary to previous findings, our results suggest GFMs offer small to moderate improvements in flood mapping accuracy at lower computational cost and labeling effort compared to traditional U-Net.

[191] SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding

Mauro Orazio Drago, Luca Carlini, Pelinsu Celebi Balyemez, Dennis Pierantozzi, Chiara Lena, Cesare Hassan, Danail Stoyanov, Elena De Momi, Sophia Bano, Mobarak I. Hoque

Main category: cs.CV

TL;DR: SurgViVQA is a surgical VideoQA model that extends visual reasoning from static images to dynamic surgical scenes using temporal cues, outperforming existing methods on colonoscopic datasets.

Details

Motivation: Current surgical VideoQA approaches are limited to static image features and lack temporal annotations, ignoring the dynamics critical for accurate procedural interpretation in surgical videos.

Method: Uses a Masked Video-Text Encoder to fuse video and question features, capturing temporal cues like motion and tool-tissue interactions, then decodes with a fine-tuned LLM. Evaluated on REAL-Colon-VQA dataset with motion-related questions and out-of-template variations.

Result: Outperforms existing image-based VQA models, improving keyword accuracy by +11% on REAL-Colon-VQA and +9% on EndoVis18-VQA. Perturbation study confirms improved generalizability and robustness to question phrasing variations.

Conclusion: SurgViVQA provides a framework for temporally-aware understanding in surgical VideoQA, enabling AI models to interpret dynamic procedural contexts more effectively.

Abstract: Video Question Answering (VideoQA) in the surgical domain aims to enhance intraoperative understanding by enabling AI models to reason over temporally coherent events rather than isolated frames. Current approaches are limited to static image features, and available datasets often lack temporal annotations, ignoring the dynamics critical for accurate procedural interpretation. We propose SurgViVQA, a surgical VideoQA model that extends visual reasoning from static images to dynamic surgical scenes. It uses a Masked Video–Text Encoder to fuse video and question features, capturing temporal cues such as motion and tool–tissue interactions, which a fine-tuned large language model (LLM) then decodes into coherent answers. To evaluate its performance, we curated REAL-Colon-VQA, a colonoscopic video dataset that includes motion-related questions and diagnostic attributes, as well as out-of-template questions with rephrased or semantically altered formulations to assess model robustness. Experimental validation on REAL-Colon-VQA and the public EndoVis18-VQA dataset shows that SurgViVQA outperforms existing image-based VQA benchmark models, particularly in keyword accuracy, improving over PitVQA by +11% on REAL-Colon-VQA and +9% on EndoVis18-VQA. A perturbation study on the questions further confirms improved generalizability and robustness to variations in question phrasing. SurgViVQA and the REAL-Colon-VQA dataset provide a framework for temporally-aware understanding in surgical VideoQA, enabling AI models to interpret dynamic procedural contexts more effectively. Code and dataset available at https://github.com/madratak/SurgViVQA.

cs.AI

[192] Scaling Agent Learning via Experience Synthesis

Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Sara Cao, Yuanhao Xiong, Haibo Tong, Huaxiu Yao, Hengduo Li, Jiacheng Zhu, Xian Li, Dawn Song, Bo Li, Jason Weston, Dat Huynh

Main category: cs.AI

TL;DR: DreamGym is a unified framework that synthesizes diverse experiences for scalable online RL training of autonomous agents, using reasoning-based experience models instead of expensive real rollouts.

Details

Motivation: Practical adoption of RL for LLM agents is challenging due to costly rollouts, limited task diversity, unreliable rewards, and infrastructure complexity that obstruct scalable experience data collection.

Method: DreamGym distills environment dynamics into reasoning-based experience models for consistent state transitions, uses experience replay buffers initialized with offline data, and adaptively generates challenging tasks for curriculum learning.

Result: DreamGym substantially improves RL training, outperforms baselines by over 30% on non-RL-ready tasks like WebArena, matches GRPO and PPO performance using only synthetic interactions, and provides significant performance gains with fewer real-world interactions in sim-to-real transfer.

Conclusion: DreamGym enables scalable warm-start strategy for general-purpose RL by synthesizing diverse experiences, reducing dependency on costly real-environment rollouts while maintaining or improving performance.

Abstract: While reinforcement learning (RL) can empower large language model (LLM) agents by enabling self-improvement through interaction, its practical adoption remains challenging due to costly rollouts, limited task diversity, unreliable reward signals, and infrastructure complexity, all of which obstruct the collection of scalable experience data. To address these challenges, we introduce DreamGym, the first unified framework designed to synthesize diverse experiences with scalability in mind to enable effective online RL training for autonomous agents. Rather than relying on expensive real-environment rollouts, DreamGym distills environment dynamics into a reasoning-based experience model that derives consistent state transitions and feedback signals through step-by-step reasoning, enabling scalable agent rollout collection for RL. To improve the stability and quality of transitions, DreamGym leverages an experience replay buffer initialized with offline real-world data and continuously enriched with fresh interactions to actively support agent training. To improve knowledge acquisition, DreamGym adaptively generates new tasks that challenge the current agent policy, enabling more effective online curriculum learning. Experiments across diverse environments and agent backbones demonstrate that DreamGym substantially improves RL training, both in fully synthetic settings and in sim-to-real transfer scenarios. On non-RL-ready tasks like WebArena, DreamGym outperforms all baselines by over 30%. And in RL-ready but costly settings, it matches GRPO and PPO performance using only synthetic interactions. When transferring a policy trained purely on synthetic experiences to real-environment RL, DreamGym yields significant additional performance gains while requiring far fewer real-world interactions, providing a scalable warm-start strategy for general-purpose RL.

[193] How Different Tokenization Algorithms Impact LLMs and Transformer Models for Binary Code Analysis

Ahmed Mostafa, Raisul Arefin Nahid, Samuel Mulder

Main category: cs.AI

TL;DR: This paper evaluates NLP tokenization models for assembly code analysis, examining intrinsic properties like vocabulary size and semantic coverage, and their impact on downstream tasks like function signature prediction.

Details

Motivation: Tokenization is fundamental in assembly code analysis but remains underexplored, with significant impact on vocabulary size, semantic coverage, and downstream task performance.

Method: Systematic study of various tokenization models with intrinsic evaluations of tokenization efficiency, vocabulary compression, and representational fidelity. Uses pre-trained models (Llama 3.2, BERT, BART) to evaluate tokenizer effectiveness across multiple performance metrics.

Result: Tokenizer choice significantly influences downstream performance, with intrinsic metrics providing partial but incomplete predictability of extrinsic evaluation outcomes. Complex trade-offs exist between intrinsic tokenizer properties and practical utility.

Conclusion: The study provides valuable insights for optimizing tokenization models in low-level code analysis, contributing to more robust and scalable NLM-based binary analysis workflows.

Abstract: Tokenization is fundamental in assembly code analysis, impacting intrinsic characteristics like vocabulary size, semantic coverage, and extrinsic performance in downstream tasks. Despite its significance, tokenization in the context of assembly code remains an underexplored area. This study aims to address this gap by evaluating the intrinsic properties of Natural Language Processing (NLP) tokenization models and parameter choices, such as vocabulary size. We explore preprocessing customization options and pre-tokenization rules tailored to the unique characteristics of assembly code. Additionally, we assess their impact on downstream tasks like function signature prediction – a critical problem in binary code analysis. To this end, we conduct a thorough study on various tokenization models, systematically analyzing their efficiency in encoding assembly instructions and capturing semantic nuances. Through intrinsic evaluations, we compare tokenizers based on tokenization efficiency, vocabulary compression, and representational fidelity for assembly code. Using state-of-the-art pre-trained models such as the decoder-only Large Language Model (LLM) Llama 3.2, the encoder-only transformer BERT, and the encoder-decoder model BART, we evaluate the effectiveness of these tokenizers across multiple performance metrics. Preliminary findings indicate that tokenizer choice significantly influences downstream performance, with intrinsic metrics providing partial but incomplete predictability of extrinsic evaluation outcomes. These results reveal complex trade-offs between intrinsic tokenizer properties and their utility in practical assembly code tasks. Ultimately, this study provides valuable insights into optimizing tokenization models for low-level code analysis, contributing to the robustness and scalability of Natural Language Model (NLM)-based binary analysis workflows.

[194] To See or To Read: User Behavior Reasoning in Multimodal LLMs

Tianning Dong, Luyi Ma, Varun Vasudevan, Jason Cho, Sushant Kumar, Kannan Achan

Main category: cs.AI

TL;DR: BehaviorLens framework compares text vs image representations of user behavior data for MLLMs, finding image representations improve next-purchase prediction accuracy by 87.5% over text.

Details

Motivation: To determine whether textual or image representations of user behavior data are more effective for maximizing Multimodal Large Language Models (MLLMs) performance in reasoning over sequential user-behavior data.

Method: Developed BehaviorLens benchmarking framework to assess modality trade-offs across six MLLMs by representing transaction data as text paragraphs, scatter plots, and flowcharts using a real-world purchase-sequence dataset.

Result: When data is represented as images, MLLMs next-purchase prediction accuracy improved by 87.5% compared to equivalent textual representation, without any additional computational cost.

Conclusion: Image representations of user behavior data significantly outperform text representations for MLLMs in next-purchase prediction tasks, demonstrating the importance of modality selection in user-behavior reasoning.

Abstract: Multimodal Large Language Models (MLLMs) are reshaping how modern agentic systems reason over sequential user-behavior data. However, whether textual or image representations of user behavior data are more effective for maximizing MLLM performance remains underexplored. We present \texttt{BehaviorLens}, a systematic benchmarking framework for assessing modality trade-offs in user-behavior reasoning across six MLLMs by representing transaction data as (1) a text paragraph, (2) a scatter plot, and (3) a flowchart. Using a real-world purchase-sequence dataset, we find that when data is represented as images, MLLMs next-purchase prediction accuracy is improved by 87.5% compared with an equivalent textual representation without any additional computational cost.

Yutao Jin, Haowen Xiao, Junyong Zhai, Yuxiao Li, Jielei Chu, Fengmao Lv, Yuxiao Li

Main category: cs.AI

TL;DR: Proposes MediAD, a visual-language causality-inspired framework using LLMs and multi-modal data for Alzheimer’s Disease diagnosis that addresses confounders through causal intervention.

Details

Motivation: Early MCI identification is crucial for slowing AD progression, but diagnosis is challenging due to confounders from multi-modal data selection bias and complex variable relationships.

Method: Uses LLMs to summarize clinical data under strict templates, then employs MRI, clinical data, and enriched textual data with unified causal intervention to mitigate confounder effects.

Result: Outperforms other methods in most evaluation metrics for distinguishing CN/MCI/AD cases, demonstrating superior diagnostic performance.

Conclusion: Shows potential of integrating causal reasoning with multi-modal learning for neurological disease diagnosis, effectively addressing confounder issues.

Abstract: Mild Cognitive Impairment (MCI) serves as a prodromal stage of Alzheimer’s Disease (AD), where early identification and intervention can effectively slow the progression to dementia. However, diagnosing AD remains a significant challenge in neurology due to the confounders caused mainly by the selection bias of multi-modal data and the complex relationships between variables. To address these issues, we propose a novel visual-language causality-inspired framework named Cross-modal Causal Intervention with Mediator for Alzheimer’s Disease Diagnosis (MediAD) for diagnostic assistance. Our MediAD employs Large Language Models (LLMs) to summarize clinical data under strict templates, therefore enriching textual inputs. The MediAD model utilizes Magnetic Resonance Imaging (MRI), clinical data, and textual data enriched by LLMs to classify participants into Cognitively Normal (CN), MCI, and AD categories. Because of the presence of confounders, such as cerebral vascular lesions and age-related biomarkers, non-causal models are likely to capture spurious input-output correlations, generating less reliable results. Our framework implicitly mitigates the effect of both observable and unobservable confounders through a unified causal intervention method. Experimental results demonstrate the outstanding performance of our method in distinguishing CN/MCI/AD cases, outperforming other methods in most evaluation metrics. The study showcases the potential of integrating causal reasoning with multi-modal learning for neurological disease diagnosis.

[196] KnowThyself: An Agentic Assistant for LLM Interpretability

Suraj Prasai, Mengnan Du, Ying Zhang, Fan Yang

Main category: cs.AI

TL;DR: KnowThyself is a chat-based LLM interpretability tool that consolidates fragmented analysis capabilities into an accessible interface with interactive visualizations and guided explanations.

Details

Motivation: Existing LLM interpretability tools are fragmented and code-intensive, creating barriers for users who want to understand model behavior without deep technical expertise.

Method: Uses an orchestrator LLM to reformulate user queries, an agent router to direct queries to specialized modules, and contextualizes outputs into coherent explanations within a chat-based interface.

Result: Created a consolidated platform that lowers technical barriers and provides extensible LLM inspection capabilities through natural language interaction.

Conclusion: KnowThyself offers a robust foundation for accessible LLM interpretability by embedding the analysis process into a conversational workflow.

Abstract: We develop KnowThyself, an agentic assistant that advances large language model (LLM) interpretability. Existing tools provide useful insights but remain fragmented and code-intensive. KnowThyself consolidates these capabilities into a chat-based interface, where users can upload models, pose natural language questions, and obtain interactive visualizations with guided explanations. At its core, an orchestrator LLM first reformulates user queries, an agent router further directs them to specialized modules, and the outputs are finally contextualized into coherent explanations. This design lowers technical barriers and provides an extensible platform for LLM inspection. By embedding the whole process into a conversational workflow, KnowThyself offers a robust foundation for accessible LLM interpretability.

[197] Extracting Causal Relations in Deep Knowledge Tracing

Kevin Hong, Kia Karbasi, Gregory Pottie

Main category: cs.AI

TL;DR: DKT’s effectiveness comes from modeling prerequisite relationships as causal structures, not bidirectional relationships between knowledge components.

Details

Motivation: To challenge the prevailing explanation that DKT's performance gains stem from bidirectional relationships between KCs and demonstrate its actual strength lies in modeling causal prerequisite relationships.

Method: Pruned exercise relation graphs into Directed Acyclic Graphs (DAGs) and trained DKT on causal subsets of the Assistments dataset. Also proposed an alternative method for extracting exercise relation DAGs using DKT’s learned representations.

Result: DKT’s predictive capabilities align strongly with causal structures, and empirical evidence supports that DKT approximates causal dependencies between KCs rather than simple relational mappings.

Conclusion: DKT’s effectiveness is largely driven by its capacity to model causal dependencies between knowledge components as prerequisite relationships, not bidirectional relationships.

Abstract: A longstanding goal in computational educational research is to develop explainable knowledge tracing (KT) models. Deep Knowledge Tracing (DKT), which leverages a Recurrent Neural Network (RNN) to predict student knowledge and performance on exercises, has been proposed as a major advancement over traditional KT methods. Several studies suggest that its performance gains stem from its ability to model bidirectional relationships between different knowledge components (KCs) within a course, enabling the inference of a student’s understanding of one KC from their performance on others. In this paper, we challenge this prevailing explanation and demonstrate that DKT’s strength lies in its implicit ability to model prerequisite relationships as a causal structure, rather than bidirectional relationships. By pruning exercise relation graphs into Directed Acyclic Graphs (DAGs) and training DKT on causal subsets of the Assistments dataset, we show that DKT’s predictive capabilities align strongly with these causal structures. Furthermore, we propose an alternative method for extracting exercise relation DAGs using DKT’s learned representations and provide empirical evidence supporting our claim. Our findings suggest that DKT’s effectiveness is largely driven by its capacity to approximate causal dependencies between KCs rather than simple relational mappings.

[198] LLMs and Cultural Values: the Impact of Prompt Language and Explicit Cultural Framing

Bram Bulté, Ayla Rigouts Terryn

Main category: cs.AI

TL;DR: LLMs show cultural bias despite prompt variations - they respond to language and cultural framing but remain anchored to values of Netherlands, Germany, US, and Japan, failing to adequately represent global cultural diversity.

Details

Motivation: To examine whether LLMs can represent cultural diversity given imbalances in training data and optimization objectives, and how prompt language and cultural framing influence model responses across different countries.

Method: Probed 10 LLMs with 63 items from Hofstede Values Survey Module and World Values Survey, translated into 11 languages, formulated as prompts with and without different explicit cultural perspectives.

Result: Prompt language and cultural perspective produce variation in LLM outputs, but models show systematic bias toward values of Netherlands, Germany, US, and Japan. Models produce neutral responses with selective progressive stances on social tolerance. Cultural framing improves alignment more than targeted prompt language.

Conclusion: LLMs occupy an uncomfortable middle ground - responsive enough to produce variation but too anchored to specific cultural defaults to adequately represent cultural diversity.

Abstract: Large Language Models (LLMs) are rapidly being adopted by users across the globe, who interact with them in a diverse range of languages. At the same time, there are well-documented imbalances in the training data and optimisation objectives of this technology, raising doubts as to whether LLMs can represent the cultural diversity of their broad user base. In this study, we look at LLMs and cultural values and examine how prompt language and cultural framing influence model responses and their alignment with human values in different countries. We probe 10 LLMs with 63 items from the Hofstede Values Survey Module and World Values Survey, translated into 11 languages, and formulated as prompts with and without different explicit cultural perspectives. Our study confirms that both prompt language and cultural perspective produce variation in LLM outputs, but with an important caveat: While targeted prompting can, to a certain extent, steer LLM responses in the direction of the predominant values of the corresponding countries, it does not overcome the models’ systematic bias toward the values associated with a restricted set of countries in our dataset: the Netherlands, Germany, the US, and Japan. All tested models, regardless of their origin, exhibit remarkably similar patterns: They produce fairly neutral responses on most topics, with selective progressive stances on issues such as social tolerance. Alignment with cultural values of human respondents is improved more with an explicit cultural perspective than with a targeted prompt language. Unexpectedly, combining both approaches is no more effective than cultural framing with an English prompt. These findings reveal that LLMs occupy an uncomfortable middle ground: They are responsive enough to changes in prompts to produce variation, but too firmly anchored to specific cultural defaults to adequately represent cultural diversity.

[199] When Empowerment Disempowers

Claire Yang, Maya Cakmak, Max Kleiman-Weiner

Main category: cs.AI

TL;DR: Empowerment-based AI assistance designed for single humans can disempower other humans in multi-human environments, revealing alignment challenges in multi-agent contexts.

Details

Motivation: To investigate how empowerment-based AI assistance, which works well in single-human settings, performs in multi-human environments where it may inadvertently reduce other humans' environmental control and rewards.

Method: Created Disempower-Grid test suite and empirically tested assistive RL agents optimizing for one human’s empowerment in multi-human gridworld environments, analyzing when disempowerment occurs and testing joint empowerment as a mitigation strategy.

Result: Agents optimizing for one human’s empowerment significantly reduced another human’s environmental influence and rewards (disempowerment). Joint empowerment mitigated disempowerment but at the cost of the user’s reward.

Conclusion: Goal-agnostic objectives like empowerment that appear aligned in single-agent settings can become misaligned in multi-agent contexts, presenting a broader challenge for AI alignment.

Abstract: Empowerment, a measure of an agent’s ability to control its environment, has been proposed as a universal goal-agnostic objective for motivating assistive behavior in AI agents. While multi-human settings like homes and hospitals are promising for AI assistance, prior work on empowerment-based assistance assumes that the agent assists one human in isolation. We introduce an open source multi-human gridworld test suite Disempower-Grid. Using Disempower-Grid, we empirically show that assistive RL agents optimizing for one human’s empowerment can significantly reduce another human’s environmental influence and rewards - a phenomenon we formalize as disempowerment. We characterize when disempowerment occurs in these environments and show that joint empowerment mitigates disempowerment at the cost of the user’s reward. Our work reveals a broader challenge for the AI alignment community: goal-agnostic objectives that seem aligned in single-agent settings can become misaligned in multi-agent contexts.

[200] ArchPilot: A Proxy-Guided Multi-Agent Approach for Machine Learning Engineering

Zhuowen Yuan, Tao Liu, Yang Yang, Yang Wang, Feng Qi, Kaushik Rangadurai, Bo Li, Shuang Yang

Main category: cs.AI

TL;DR: ArchPilot is a multi-agent system for automated ML engineering that uses proxy-based evaluation and adaptive search to reduce computational overhead from repeated full training runs.

Details

Motivation: Current LLM-based agents for ML engineering rely heavily on repeated full training runs, causing high computational costs, limited scalability, and slow iteration cycles.

Method: Three-agent system: orchestration agent with MCTS-inspired algorithm and restart mechanism, generation agent for architecture creation/improvement, and evaluation agent for proxy training and fidelity-aware scoring.

Result: Outperforms SOTA baselines like AIDE and ML-Master on MLE-Bench, demonstrating effective prioritization of high-potential candidates with minimal full training.

Conclusion: ArchPilot’s multi-agent collaboration enables efficient ML engineering under limited budgets by reducing dependency on expensive full training runs.

Abstract: Recent LLM-based agents have demonstrated strong capabilities in automated ML engineering. However, they heavily rely on repeated full training runs to evaluate candidate solutions, resulting in significant computational overhead, limited scalability to large search spaces, and slow iteration cycles. To address these challenges, we introduce ArchPilot, a multi-agent system that integrates architecture generation, proxy-based evaluation, and adaptive search into a unified framework. ArchPilot consists of three specialized agents: an orchestration agent that coordinates the search process using a Monte Carlo Tree Search (MCTS)-inspired novel algorithm with a restart mechanism and manages memory of previous candidates; a generation agent that iteratively generates, improves, and debugs candidate architectures; and an evaluation agent that executes proxy training runs, generates and optimizes proxy functions, and aggregates the proxy scores into a fidelity-aware performance metric. This multi-agent collaboration allows ArchPilot to prioritize high-potential candidates with minimal reliance on expensive full training runs, facilitating efficient ML engineering under limited budgets. Experiments on MLE-Bench demonstrate that ArchPilot outperforms SOTA baselines such as AIDE and ML-Master, validating the effectiveness of our multi-agent system.

[201] Detecting Silent Failures in Multi-Agentic AI Trajectories

Divya Pathak, Harshit Kumar, Anuska Roy, Felix George, Mudit Verma, Pratibha Moogi

Main category: cs.AI

TL;DR: This paper introduces anomaly detection for multi-agent AI systems to identify silent failures like drift, cycles, and missing details in LLM-powered agent trajectories.

Details

Motivation: Multi-agent AI systems using LLMs are non-deterministic and prone to silent failures that are difficult to detect, requiring systematic methods to identify these anomalies.

Method: Created a dataset curation pipeline capturing user behavior, agent non-determinism, and LLM variation, then benchmarked supervised (XGBoost) and semi-supervised (SVDD) anomaly detection approaches.

Result: Achieved high accuracies of 98% with XGBoost and 96% with SVDD on benchmark datasets of 4,275 and 894 trajectories from multi-agent AI systems.

Conclusion: This work provides the first systematic study of anomaly detection in multi-agent AI systems, offering datasets, benchmarks, and insights to guide future research in this area.

Abstract: Multi-Agentic AI systems, powered by large language models (LLMs), are inherently non-deterministic and prone to silent failures such as drift, cycles, and missing details in outputs, which are difficult to detect. We introduce the task of anomaly detection in agentic trajectories to identify these failures and present a dataset curation pipeline that captures user behavior, agent non-determinism, and LLM variation. Using this pipeline, we curate and label two benchmark datasets comprising \textbf{4,275 and 894} trajectories from Multi-Agentic AI systems. Benchmarking anomaly detection methods on these datasets, we show that supervised (XGBoost) and semi-supervised (SVDD) approaches perform comparably, achieving accuracies up to 98% and 96%, respectively. This work provides the first systematic study of anomaly detection in Multi-Agentic AI systems, offering datasets, benchmarks, and insights to guide future research.

[202] Large language models replicate and predict human cooperation across experiments in game theory

Andrea Cera Palatsi, Samuel Martin-Gutierrez, Ana S. Cardenal, Max Pellert

Main category: cs.AI

TL;DR: LLMs can replicate human decision-making patterns in game theory experiments, with Llama showing high fidelity to human cooperation patterns and Qwen aligning with Nash equilibrium predictions, enabling systematic exploration of social behavior.

Details

Motivation: To understand how closely LLMs mirror actual human decision-making, as misalignment could produce harmful outcomes in practical applications and failure to replicate human behavior makes LLMs ineffective for social simulations.

Method: Developed a digital twin of game-theoretic experiments with a systematic prompting and probing framework for machine-behavioral evaluation, testing three open-source models (Llama, Mistral, Qwen).

Result: Llama reproduced human cooperation patterns with high fidelity, capturing human deviations from rational choice theory, while Qwen aligned closely with Nash equilibrium predictions. Achieved population-level behavioral replication without persona-based prompting.

Conclusion: Appropriately calibrated LLMs can replicate aggregate human behavioral patterns and enable systematic exploration of unexplored experimental spaces, offering a complementary approach to traditional social science research that generates new empirical predictions.

Abstract: Large language models (LLMs) are increasingly used both to make decisions in domains such as health, education and law, and to simulate human behavior. Yet how closely LLMs mirror actual human decision-making remains poorly understood. This gap is critical: misalignment could produce harmful outcomes in practical applications, while failure to replicate human behavior renders LLMs ineffective for social simulations. Here, we address this gap by developing a digital twin of game-theoretic experiments and introducing a systematic prompting and probing framework for machine-behavioral evaluation. Testing three open-source models (Llama, Mistral and Qwen), we find that Llama reproduces human cooperation patterns with high fidelity, capturing human deviations from rational choice theory, while Qwen aligns closely with Nash equilibrium predictions. Notably, we achieved population-level behavioral replication without persona-based prompting, simplifying the simulation process. Extending beyond the original human-tested games, we generate and preregister testable hypotheses for novel game configurations outside the original parameter grid. Our findings demonstrate that appropriately calibrated LLMs can replicate aggregate human behavioral patterns and enable systematic exploration of unexplored experimental spaces, offering a complementary approach to traditional research in the social and behavioral sciences that generates new empirical predictions about human social decision-making.

[203] Interpreting Multi-Attribute Confounding through Numerical Attributes in Large Language Models

Hirohane Takagi, Gouki Minegishi, Shota Kizawa, Issey Sukeda, Hitomi Yanaka

Main category: cs.AI

TL;DR: LLMs encode real-world numerical correlations but systematically amplify them, and irrelevant context causes shifts in magnitude representations affecting decision-making.

Details

Motivation: To understand how LLMs internally integrate multiple numerical attributes and how irrelevant numerical context affects their representations and outputs.

Method: Combined linear probing with partial correlation analysis and prompt-based vulnerability tests across models of varying sizes.

Result: LLMs encode real-world numerical correlations but tend to systematically amplify them, and irrelevant context induces consistent shifts in magnitude representations with downstream effects varying by model size.

Conclusion: The findings reveal vulnerabilities in LLM decision-making and provide groundwork for fairer, representation-aware control under multi-attribute entanglement.

Abstract: Although behavioral studies have documented numerical reasoning errors in large language models (LLMs), the underlying representational mechanisms remain unclear. We hypothesize that numerical attributes occupy shared latent subspaces and investigate two questions:(1) How do LLMs internally integrate multiple numerical attributes of a single entity? (2)How does irrelevant numerical context perturb these representations and their downstream outputs? To address these questions, we combine linear probing with partial correlation analysis and prompt-based vulnerability tests across models of varying sizes. Our results show that LLMs encode real-world numerical correlations but tend to systematically amplify them. Moreover, irrelevant context induces consistent shifts in magnitude representations, with downstream effects that vary by model size. These findings reveal a vulnerability in LLM decision-making and lay the groundwork for fairer, representation-aware control under multi-attribute entanglement.

[204] Agentmandering: A Game-Theoretic Framework for Fair Redistricting via Large Language Model Agents

Hao Li, Haotian Chen, Ruoyuan Gong, Juanjuan Wang, Hao Jiang

Main category: cs.AI

TL;DR: Agentmandering is a game-theoretic redistricting framework using LLM agents for turn-based negotiation between political parties, reducing partisan bias and variance compared to standard methods.

Details

Motivation: Existing redistricting methods generate ensembles of valid maps but ignore strategic selection processes, allowing partisan cherry-picking of technically compliant but politically advantageous maps.

Method: Proposes Agentmandering - a turn-based negotiation framework using LLM agents representing opposing political interests, based on Choose-and-Freeze protocol where agents alternate selecting and freezing districts from candidate maps.

Result: Evaluation on post-2020 U.S. Census data shows significant reduction in partisan bias and unfairness, with 2-3 orders of magnitude lower variance than baselines, especially effective in swing-state scenarios.

Conclusion: The framework demonstrates both fairness and stability in redistricting by embedding strategic interaction into the process, addressing manipulation opportunities in map selection.

Abstract: Redistricting plays a central role in shaping how votes are translated into political power. While existing computational methods primarily aim to generate large ensembles of legally valid districting plans, they often neglect the strategic dynamics involved in the selection process. This oversight creates opportunities for partisan actors to cherry-pick maps that, while technically compliant, are politically advantageous. Simply satisfying formal constraints does not ensure fairness when the selection process itself can be manipulated. We propose \textbf{Agentmandering}, a framework that reimagines redistricting as a turn-based negotiation between two agents representing opposing political interests. Drawing inspiration from game-theoretic ideas, particularly the \textit{Choose-and-Freeze} protocol, our method embeds strategic interaction into the redistricting process via large language model (LLM) agents. Agents alternate between selecting and freezing districts from a small set of candidate maps, gradually partitioning the state through constrained and interpretable choices. Evaluation on post-2020 U.S. Census data across all states shows that Agentmandering significantly reduces partisan bias and unfairness, while achieving 2 to 3 orders of magnitude lower variance than standard baselines. These results demonstrate both fairness and stability, especially in swing-state scenarios. Our code is available at https://github.com/Lihaogx/AgentMandering.

[205] DR. WELL: Dynamic Reasoning and Learning with Symbolic World Model for Embodied LLM-Based Multi-Agent Collaboration

Narjes Nourzad, Hanqing Yang, Shiyu Chen, Carlee Joe-Wong

Main category: cs.AI

TL;DR: DR.WELL is a decentralized neurosymbolic framework for cooperative multi-agent planning that uses symbolic planning and a two-phase negotiation protocol to enable robust coordination without requiring detailed trajectory alignment.

Details

Motivation: Cooperative multi-agent planning faces challenges with partial information, limited communication, and brittle coordination at trajectory level where small timing deviations cause conflicts.

Method: Two-phase negotiation protocol: agents propose candidate roles with reasoning, then commit to joint allocation under consensus and environment constraints. Each agent independently generates symbolic plans for its role without revealing detailed trajectories, using a shared world model.

Result: Experiments on cooperative block-push tasks show improved task completion rates and efficiency, with dynamic world model capturing reusable patterns and enabling evolving collaboration strategies.

Conclusion: Symbolic planning enables higher-level operations that are reusable, synchronizable, and interpretable, avoiding brittle step-level alignment while trading time overhead for more efficient collaboration.

Abstract: Cooperative multi-agent planning requires agents to make joint decisions with partial information and limited communication. Coordination at the trajectory level often fails, as small deviations in timing or movement cascade into conflicts. Symbolic planning mitigates this challenge by raising the level of abstraction and providing a minimal vocabulary of actions that enable synchronization and collective progress. We present DR. WELL, a decentralized neurosymbolic framework for cooperative multi-agent planning. Cooperation unfolds through a two-phase negotiation protocol: agents first propose candidate roles with reasoning and then commit to a joint allocation under consensus and environment constraints. After commitment, each agent independently generates and executes a symbolic plan for its role without revealing detailed trajectories. Plans are grounded in execution outcomes via a shared world model that encodes the current state and is updated as agents act. By reasoning over symbolic plans rather than raw trajectories, DR. WELL avoids brittle step-level alignment and enables higher-level operations that are reusable, synchronizable, and interpretable. Experiments on cooperative block-push tasks show that agents adapt across episodes, with the dynamic world model capturing reusable patterns and improving task completion rates and efficiency. Experiments on cooperative block-push tasks show that our dynamic world model improves task completion and efficiency through negotiation and self-refinement, trading a time overhead for evolving, more efficient collaboration strategies.

[206] KGFR: A Foundation Retriever for Generalized Knowledge Graph Question Answering

Yuanning Cui, Zequn Sun, Wei Hu, Zhangjie Fu

Main category: cs.AI

TL;DR: LLM-KGFR framework combines LLMs with a Knowledge Graph Foundation Retriever for zero-shot reasoning on knowledge graphs, using asymmetric progressive propagation to handle large graphs efficiently.

Details

Motivation: LLMs struggle with knowledge-intensive questions due to limited context and parametric knowledge, while existing methods are limited by dataset-specific tuning and poor scalability on large or unseen graphs.

Method: Collaborative framework where LLM works with KGFR that encodes relations using LLM-generated descriptions and initializes entities based on question roles. Uses Asymmetric Progressive Propagation for stepwise expansion with selective high-degree node limitation.

Result: Achieves strong performance while maintaining scalability and generalization, providing a practical solution for KG-augmented reasoning.

Conclusion: LLM-KGFR enables zero-shot generalization to unseen KGs and efficient handling of large graphs through controlled reasoning loops.

Abstract: Large language models (LLMs) excel at reasoning but struggle with knowledge-intensive questions due to limited context and parametric knowledge. However, existing methods that rely on finetuned LLMs or GNN retrievers are limited by dataset-specific tuning and scalability on large or unseen graphs. We propose the LLM-KGFR collaborative framework, where an LLM works with a structured retriever, the Knowledge Graph Foundation Retriever (KGFR). KGFR encodes relations using LLM-generated descriptions and initializes entities based on their roles in the question, enabling zero-shot generalization to unseen KGs. To handle large graphs efficiently, it employs Asymmetric Progressive Propagation (APP)- a stepwise expansion that selectively limits high-degree nodes while retaining informative paths. Through node-, edge-, and path-level interfaces, the LLM iteratively requests candidate answers, supporting facts, and reasoning paths, forming a controllable reasoning loop. Experiments demonstrate that LLM-KGFR achieves strong performance while maintaining scalability and generalization, providing a practical solution for KG-augmented reasoning.

[207] Testing the Testers: Human-Driven Quality Assessment of Voice AI Testing Platforms

Miguel E. Andres, Vadim Fedorov, Rida Sadek, Enric Spagnolo-Arrizabalaga, Nadescha Trudel

Main category: cs.AI

TL;DR: First systematic framework for evaluating voice AI testing quality through human-centered benchmarking, addressing both simulation quality (realistic test conversations) and evaluation quality (accurate response assessment).

Details

Motivation: Voice AI agents are scaling to billions of daily interactions, but organizations lack objective methods to assess whether their testing approaches actually work, creating a critical measurement gap.

Method: Combines psychometric techniques (pairwise comparisons with Elo ratings, bootstrap confidence intervals, permutation tests) with rigorous statistical validation to provide reproducible metrics for any testing platform.

Result: Empirical evaluation of three commercial platforms using 21,600 human judgments revealed statistically significant performance differences. Top platform Evalion achieved 0.92 evaluation quality (f1-score) vs 0.73 for others, and 0.61 simulation quality vs 0.43 for others.

Conclusion: The framework enables researchers and organizations to empirically validate testing capabilities of any platform, providing essential measurement foundations for confident voice AI deployment at scale.

Abstract: Voice AI agents are rapidly transitioning to production deployments, yet systematic methods for ensuring testing reliability remain underdeveloped. Organizations cannot objectively assess whether their testing approaches (internal tools or external platforms) actually work, creating a critical measurement gap as voice AI scales to billions of daily interactions. We present the first systematic framework for evaluating voice AI testing quality through human-centered benchmarking. Our methodology addresses the fundamental dual challenge of testing platforms: generating realistic test conversations (simulation quality) and accurately evaluating agent responses (evaluation quality). The framework combines established psychometric techniques (pairwise comparisons yielding Elo ratings, bootstrap confidence intervals, and permutation tests) with rigorous statistical validation to provide reproducible metrics applicable to any testing approach. To validate the framework and demonstrate its utility, we conducted comprehensive empirical evaluation of three leading commercial platforms focused on Voice AI Testing using 21,600 human judgments across 45 simulations and ground truth validation on 60 conversations. Results reveal statistically significant performance differences with the proposed framework, with the top-performing platform, Evalion, achieving 0.92 evaluation quality measured as f1-score versus 0.73 for others, and 0.61 simulation quality using a league based scoring system (including ties) vs 0.43 for other platforms. This framework enables researchers and organizations to empirically validate the testing capabilities of any platform, providing essential measurement foundations for confident voice AI deployment at scale. Supporting materials are made available to facilitate reproducibility and adoption.

[208] Opus: A Quantitative Framework for Workflow Evaluation

Alan Seroul, Théo Fagnoni, Inès Adnani, Dana O. Mohamed, Phillip Kingston

Main category: cs.AI

TL;DR: The paper introduces the Opus Workflow Evaluation Framework, a probabilistic-normative model for quantifying workflow quality and efficiency through reward functions and normative penalties.

Details

Motivation: To enable systematic comparison, scoring, and optimization of workflows by integrating correctness, reliability, and cost into a coherent mathematical framework.

Method: Combines Opus Workflow Reward (probabilistic function for expected performance) with Opus Workflow Normative Penalties (measurable functions for structural/informational quality across cohesion, coupling, observability, and information hygiene).

Result: A unified framework that supports automated workflow assessment, ranking, and optimization, and can be integrated into reinforcement learning loops for workflow discovery and refinement.

Conclusion: The proposed framework provides a comprehensive mathematical foundation for workflow evaluation and optimization, enabling direct comparison and systematic improvement of workflows in modern automation systems.

Abstract: This paper introduces the Opus Workflow Evaluation Framework, a probabilistic-normative formulation for quantifying Workflow quality and efficiency. It integrates notions of correctness, reliability, and cost into a coherent mathematical model that enables direct comparison, scoring, and optimization of Workflows. The framework combines the Opus Workflow Reward, a probabilistic function estimating expected performance through success likelihood, resource usage, and output gain, with the Opus Workflow Normative Penalties, a set of measurable functions capturing structural and informational quality across Cohesion, Coupling, Observability, and Information Hygiene. It supports automated Workflow assessment, ranking, and optimization within modern automation systems such as Opus and can be integrated into Reinforcement Learning loops to guide Workflow discovery and refinement. In this paper, we introduce the Opus Workflow Reward model that formalizes Workflow success as a probabilistic expectation over costs and outcomes. We define measurable Opus Workflow Normative Penalties capturing structural, semantic, and signal-related properties of Workflows. Finally, we propose a unified optimization formulation for identifying and ranking optimal Workflows under joint Reward-Penalty trade-offs.

[209] Shared Spatial Memory Through Predictive Coding

Zhengru Fang, Yu Guo, Jingjing Wang, Yuang Zhang, Haonan An, Yinhai Wang, Yuguang Fang

Main category: cs.AI

TL;DR: A multi-agent predictive coding framework that minimizes mutual uncertainty among agents, enabling bandwidth-efficient communication and social place cell representations for resilient coordination under limited bandwidth.

Details

Motivation: Address catastrophic coordination failures in multi-agent systems due to partial observability and limited bandwidth by developing a principled approach for sharing and reconstructing consistent spatial memory.

Method: Multi-agent predictive coding with information bottleneck objective, self-supervised motion prediction for grid-cell-like spatial coding, hierarchical reinforcement learning policy for active exploration, and emergent social place cells.

Result: Exceptional resilience to bandwidth constraints: success degrades gracefully from 73.5% to 64.4% as bandwidth shrinks from 128 to 4 bits/step, outperforming full-broadcast baseline (67.6% to 28.6%).

Conclusion: Establishes a theoretically principled and biologically plausible basis for how complex social representations emerge from unified predictive drive, leading to social collective intelligence.

Abstract: Sharing and reconstructing a consistent spatial memory is a critical challenge in multi-agent systems, where partial observability and limited bandwidth often lead to catastrophic failures in coordination. We introduce a multi-agent predictive coding framework that formulate coordination as the minimization of mutual uncertainty among agents. Instantiated as an information bottleneck objective, it prompts agents to learn not only who and what to communicate but also when. At the foundation of this framework lies a grid-cell-like metric as internal spatial coding for self-localization, emerging spontaneously from self-supervised motion prediction. Building upon this internal spatial code, agents gradually develop a bandwidth-efficient communication mechanism and specialized neural populations that encode partners’ locations: an artificial analogue of hippocampal social place cells (SPCs). These social representations are further enacted by a hierarchical reinforcement learning policy that actively explores to reduce joint uncertainty. On the Memory-Maze benchmark, our approach shows exceptional resilience to bandwidth constraints: success degrades gracefully from 73.5% to 64.4% as bandwidth shrinks from 128 to 4 bits/step, whereas a full-broadcast baseline collapses from 67.6% to 28.6%. Our findings establish a theoretically principled and biologically plausible basis for how complex social representations emerge from a unified predictive drive, leading to social collective intelligence.

[210] RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization

Zeng Zhiyuan, Jiashuo Liu, Zhangyue Yin, Ge Zhang, Wenhao Huang, Xipeng Qiu

Main category: cs.AI

TL;DR: RLoop is a self-improving framework that addresses RL overfitting in reasoning models by creating a virtuous cycle of exploration and exploitation through iterative policy initialization.

Details

Motivation: To solve RL overfitting where models gain training rewards but lose generalization due to policy over-specialization and catastrophic forgetting of diverse solutions.

Method: Uses iterative policy initialization with RL exploration followed by filtering successful trajectories into expert datasets, then applies Rejection-sampling Fine-Tuning (RFT) to refine policies for the next iteration.

Result: RLoop mitigates forgetting and improves generalization, boosting average accuracy by 9% and pass@32 by over 15% compared to vanilla RL.

Conclusion: RLoop effectively converts transient policy variations into robust performance gains through its exploration-exploitation loop.

Abstract: While Reinforcement Learning for Verifiable Rewards (RLVR) is powerful for training large reasoning models, its training dynamics harbor a critical challenge: RL overfitting, where models gain training rewards but lose generalization. Our analysis reveals this is driven by policy over-specialization and catastrophic forgetting of diverse solutions generated during training. Standard optimization discards this valuable inter-step policy diversity. To address this, we introduce RLoop, a self-improving framework built on iterative policy initialization. RLoop transforms the standard training process into a virtuous cycle: it first uses RL to explore the solution space from a given policy, then filters the successful trajectories to create an expert dataset. This dataset is used via Rejection-sampling Fine-Tuning (RFT) to refine the initial policy, creating a superior starting point for the next iteration. This loop of exploration and exploitation via iterative re-initialization effectively converts transient policy variations into robust performance gains. Our experiments show RLoop mitigates forgetting and substantially improves generalization, boosting average accuracy by 9% and pass@32 by over 15% compared to vanilla RL.

[211] Collaboration Dynamics and Reliability Challenges of Multi-Agent LLM Systems in Finite Element Analysis

Chuan Tian, Yilei Zhang

Main category: cs.AI

TL;DR: Multi-agent LLM systems for FEA show that functional complementarity matters more than team size, with 3-agent Coder-Executor-Critic configuration performing best, but systematic failure modes like affirmation bias and verification gaps persist.

Details

Motivation: To understand how inter-agent dynamics influence reasoning quality and verification reliability in LLM-based multi-agent systems for scientific workflows, specifically linear-elastic Finite Element Analysis.

Method: Used AutoGen-based multi-agent framework with 7 role configurations across 4 tasks under fixed 12-turn conversation limit, analyzing 1,120 controlled trials to evaluate collaboration effectiveness.

Result: Three-agent Coder-Executor-Critic configuration uniquely produced physically and visually correct solutions, while adding redundant reviewers reduced success rates. Three systematic failure modes identified: affirmation bias (85-92% agreement including errors), premature consensus, and verification-validation gap.

Conclusion: Proposed design principles: assign complementary agent roles, enforce multi-level validation (execution, specification, physics), and prevent early consensus through adversarial or trigger-based interaction control for trustworthy LLM collaborations in engineering workflows.

Abstract: Large Language Model (LLM)-based multi-agent systems are increasingly applied to automate computational workflows in science and engineering. However, how inter-agent dynamics influence reasoning quality and verification reliability remains unclear. We study these mechanisms using an AutoGen-based multi-agent framework for linear-elastic Finite Element Analysis (FEA), evaluating seven role configurations across four tasks under a fixed 12-turn conversation limit. From 1,120 controlled trials, we find that collaboration effectiveness depends more on functional complementarity than team size: the three-agent Coder-Executor-Critic configuration uniquely produced physically and visually correct solutions, while adding redundant reviewers reduced success rates. Yet three systematic failure modes persist: (1) affirmation bias, where the Rebuttal agent endorsed rather than challenged outputs (85-92% agreement, including errors); (2) premature consensus caused by redundant reviewers; and (3) a verification-validation gap where executable but physically incorrect code passed undetected. No agent combination successfully validated constitutive relations in complex tasks. Building on theories of functional diversity, role differentiation, and computational validation, we propose actionable design principles: (i) assign complementary agent roles, (ii) enforce multi-level validation (execution, specification, physics), and (iii) prevent early consensus through adversarial or trigger-based interaction control. These findings establish a principled foundation for designing trustworthy LLM collaborations in engineering workflows.

[212] GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents

Jian Mu, Chaoyun Zhang, Chiming Ni, Lu Wang, Bo Qiao, Kartik Mathur, Qianhui Wu, Yuhang Xie, Xiaojun Ma, Mengyu Zhou, Si Qin, Liqun Li, Yu Kang, Minghua Ma, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang

Main category: cs.AI

TL;DR: GUI-360° is a large-scale dataset and benchmark for computer-using agents (CUAs) that addresses gaps in real-world tasks, automated data collection, and unified evaluation of GUI grounding, screen parsing, and action prediction.

Details

Motivation: Address three persistent gaps in CUA research: scarcity of real-world tasks, lack of automated multi-modal trajectory collection pipelines, and absence of unified benchmarks for GUI grounding, screen parsing, and action prediction.

Method: Uses an LLM-augmented automated pipeline for query sourcing, environment-template construction, task instantiation, batched execution, and LLM-driven quality filtering. Contains 1.2M+ action steps across Windows office applications with screenshots, accessibility metadata, goals, reasoning traces, and both successful/failed trajectories.

Result: Benchmarking reveals substantial shortcomings in state-of-the-art vision-language models for grounding and action prediction. Supervised fine-tuning and reinforcement learning provide significant improvements but don’t reach human-level reliability.

Conclusion: GUI-360° facilitates reproducible research and accelerates progress on robust desktop computer-using agents by providing a comprehensive dataset and benchmark suite.

Abstract: We introduce GUI-360$^\circ$, a large-scale, comprehensive dataset and benchmark suite designed to advance computer-using agents (CUAs). CUAs present unique challenges and is constrained by three persistent gaps: a scarcity of real-world CUA tasks, the lack of automated collection-and-annotation pipelines for multi-modal trajectories, and the absence of a unified benchmark that jointly evaluates GUI grounding, screen parsing, and action prediction. GUI-360$^\circ$ addresses these gaps with an LLM-augmented, largely automated pipeline for query sourcing, environment-template construction, task instantiation, batched execution, and LLM-driven quality filtering. The released corpus contains over 1.2M executed action steps across thousands of trajectories in popular Windows office applications, and includes full-resolution screenshots, accessibility metadata when available, instantiated goals, intermediate reasoning traces, and both successful and failed action trajectories. The dataset supports three canonical tasks, GUI grounding, screen parsing, and action prediction, and a hybrid GUI+API action space that reflects modern agent designs. Benchmarking state-of-the-art vision–language models on GUI-360$^\circ$ reveals substantial out-of-the-box shortcomings in grounding and action prediction; supervised fine-tuning and reinforcement learning yield significant gains but do not close the gap to human-level reliability. We release GUI-360$^\circ$ and accompanying code to facilitate reproducible research and accelerate progress on robust desktop CUAs. The full dataset has been made public on https://huggingface.co/datasets/vyokky/GUI-360.

[213] Probing the Probes: Methods and Metrics for Concept Alignment

Jacob Lysnæs-Larsen, Marte Eggen, Inga Strümke

Main category: cs.AI

TL;DR: Probe accuracy alone is unreliable for assessing Concept Activation Vectors (CAVs) in explainable AI, as probes often capture spurious correlations rather than intended concepts. The paper introduces spatial linear attribution for concept localization and proposes new metrics for quantitative concept alignment assessment.

Details

Motivation: Current CAV evaluation relies heavily on probe classification accuracy, but this doesn't guarantee concept alignment. Probes often learn spurious correlations instead of representing the intended concept, undermining the reliability of concept-based explanations.

Method: Introduced spatial linear attribution for concept localization, compared with existing feature visualization techniques. Proposed three alignment metrics: hard accuracy, segmentation scores, and augmentation robustness. Demonstrated misaligned probes that exploit spurious correlations achieve similar accuracy to standard probes.

Result: Found that probes with translation invariance and spatial alignment consistently improve concept alignment. Deliberately misaligned probes achieved accuracy close to standard probes, confirming probe accuracy alone is unreliable for concept alignment assessment.

Conclusion: Probe accuracy is insufficient for evaluating CAV concept alignment. New alignment-based metrics and tailored probe designs (considering model architecture and concept nature) are essential for reliable concept-based explanations in explainable AI.

Abstract: In explainable AI, Concept Activation Vectors (CAVs) are typically obtained by training linear classifier probes to detect human-understandable concepts as directions in the activation space of deep neural networks. It is widely assumed that a high probe accuracy indicates a CAV faithfully representing its target concept. However, we show that the probe’s classification accuracy alone is an unreliable measure of concept alignment, i.e., the degree to which a CAV captures the intended concept. In fact, we argue that probes are more likely to capture spurious correlations than they are to represent only the intended concept. As part of our analysis, we demonstrate that deliberately misaligned probes constructed to exploit spurious correlations, achieve an accuracy close to that of standard probes. To address this severe problem, we introduce a novel concept localization method based on spatial linear attribution, and provide a comprehensive comparison of it to existing feature visualization techniques for detecting and mitigating concept misalignment. We further propose three classes of metrics for quantitatively assessing concept alignment: hard accuracy, segmentation scores, and augmentation robustness. Our analysis shows that probes with translation invariance and spatial alignment consistently increase concept alignment. These findings highlight the need for alignment-based evaluation metrics rather than probe accuracy, and the importance of tailoring probes to both the model architecture and the nature of the target concept.

[214] A Principle of Targeted Intervention for Multi-Agent Reinforcement Learning

Anjie Liu, Jianhong Wang, Samuel Kaski, Jun Wang, Mengyue Yang

Main category: cs.AI

TL;DR: This paper proposes using multi-agent influence diagrams (MAIDs) to address challenges in steering cooperative multi-agent reinforcement learning (MARL), introducing a targeted intervention paradigm applied to single agents to avoid impractical global guidance.

Details

Motivation: Steering cooperative MARL towards desired outcomes is challenging when global human guidance is impractical in large-scale systems, and existing coordination mechanisms lack easy-to-use research tools.

Method: Uses multi-agent influence diagrams (MAIDs) as a graphical framework, introduces MARL interaction paradigms, designs targeted intervention paradigm for single agents, and implements Pre-Strategy Intervention (PSI) causal inference technique.

Result: Demonstrates effectiveness of targeted intervention in experiments and verifies relevance graph analysis results, showing composite desired outcomes can be achieved by maximizing causal effects through PSI.

Conclusion: MAIDs provide a valuable framework for analyzing MARL interaction paradigms, and targeted intervention with PSI effectively addresses global guidance challenges while enabling composite outcome achievement through causal inference.

Abstract: Steering cooperative multi-agent reinforcement learning (MARL) towards desired outcomes is challenging, particularly when the global guidance from a human on the whole multi-agent system is impractical in a large-scale MARL. On the other hand, designing external mechanisms (e.g., intrinsic rewards and human feedback) to coordinate agents mostly relies on empirical studies, lacking a easy-to-use research tool. In this work, we employ multi-agent influence diagrams (MAIDs) as a graphical framework to address the above issues. First, we introduce the concept of MARL interaction paradigms (orthogonal to MARL learning paradigms), using MAIDs to analyze and visualize both unguided self-organization and global guidance mechanisms in MARL. Then, we design a new MARL interaction paradigm, referred to as the targeted intervention paradigm that is applied to only a single targeted agent, so the problem of global guidance can be mitigated. In implementation, we introduce a causal inference technique, referred to as Pre-Strategy Intervention (PSI), to realize the targeted intervention paradigm. Since MAIDs can be regarded as a special class of causal diagrams, a composite desired outcome that integrates the primary task goal and an additional desired outcome can be achieved by maximizing the corresponding causal effect through the PSI. Moreover, the bundled relevance graph analysis of MAIDs provides a tool to identify whether an MARL learning paradigm is workable under the design of an MARL interaction paradigm. In experiments, we demonstrate the effectiveness of our proposed targeted intervention, and verify the result of relevance graph analysis.

[215] AdversariaLLM: A Unified and Modular Toolbox for LLM Robustness Research

Tim Beyer, Jonas Dornbusch, Jakob Steimle, Moritz Ladenburger, Leo Schwinn, Stephan Günnemann

Main category: cs.AI

TL;DR: AdversariaLLM is a toolbox for LLM jailbreak robustness research that addresses fragmentation in the field by providing reproducible, correct, and extensible implementations of 12 attack algorithms, 7 benchmark datasets, and integration with open-weight LLMs.

Details

Motivation: The rapid expansion of LLM safety research has created a fragmented and buggy ecosystem, making reproducibility and comparability across studies challenging and hindering meaningful progress.

Method: Designs a toolbox centered on reproducibility, correctness, and extensibility. Implements 12 adversarial attack algorithms, integrates 7 benchmark datasets spanning harmfulness, over-refusal, and utility evaluation, and provides access to open-weight LLMs via Hugging Face. Includes features like compute-resource tracking, deterministic results, and distributional evaluation techniques.

Result: The framework establishes a robust foundation for transparent, comparable, and reproducible research in LLM safety through comprehensive implementation and integration capabilities.

Conclusion: AdversariaLLM addresses the fragmentation in LLM safety research by providing a unified toolbox that enables reproducible, comparable, and transparent studies, facilitating meaningful progress in the field.

Abstract: The rapid expansion of research on Large Language Model (LLM) safety and robustness has produced a fragmented and oftentimes buggy ecosystem of implementations, datasets, and evaluation methods. This fragmentation makes reproducibility and comparability across studies challenging, hindering meaningful progress. To address these issues, we introduce AdversariaLLM, a toolbox for conducting LLM jailbreak robustness research. Its design centers on reproducibility, correctness, and extensibility. The framework implements twelve adversarial attack algorithms, integrates seven benchmark datasets spanning harmfulness, over-refusal, and utility evaluation, and provides access to a wide range of open-weight LLMs via Hugging Face. The implementation includes advanced features for comparability and reproducibility such as compute-resource tracking, deterministic results, and distributional evaluation techniques. \name also integrates judging through the companion package JudgeZoo, which can also be used independently. Together, these components aim to establish a robust foundation for transparent, comparable, and reproducible research in LLM safety.

[216] Toward Autonomous Engineering Design: A Knowledge-Guided Multi-Agent Framework

Varun Kumar, George Em Karniadakis

Main category: cs.AI

TL;DR: A multi-agent AI framework for engineering design that uses specialized agents (Graph Ontologist, Design Engineer, Systems Engineer) to collaboratively generate and refine designs through iterative review loops, demonstrated on NACA airfoil optimization.

Details

Motivation: Traditional engineering design processes are resource-intensive and inefficient due to multi-domain expertise requirements and complex collaborations.

Method: Three-agent framework: Graph Ontologist builds domain knowledge graphs using LLMs, Systems Engineer formulates requirements, Design Engineer generates candidates, and iterative feedback loops refine designs until validation.

Result: Successfully applied to aerodynamic optimization of 4-digit NACA airfoils, demonstrating enhanced efficiency, consistency, and quality in engineering design.

Conclusion: Collaborative AI agents with structured knowledge representations can significantly improve engineering design processes through enhanced efficiency, consistency, and quality.

Abstract: The engineering design process often demands expertise from multiple domains, leading to complex collaborations and iterative refinements. Traditional methods can be resource-intensive and prone to inefficiencies. To address this, we formalize the engineering design process through a multi-agent AI framework that integrates structured design and review loops. The framework introduces specialized knowledge-driven agents that collaborate to generate and refine design candidates. As an exemplar, we demonstrate its application to the aerodynamic optimization of 4-digit NACA airfoils. The framework consists of three key AI agents: a Graph Ontologist, a Design Engineer, and a Systems Engineer. The Graph Ontologist employs a Large Language Model (LLM) to construct two domain-specific knowledge graphs from airfoil design literature. The Systems Engineer, informed by a human manager, formulates technical requirements that guide design generation and evaluation. The Design Engineer leverages the design knowledge graph and computational tools to propose candidate airfoils meeting these requirements. The Systems Engineer reviews and provides feedback both qualitative and quantitative using its own knowledge graph, forming an iterative feedback loop until a design is validated by the manager. The final design is then optimized to maximize performance metrics such as the lift-to-drag ratio. Overall, this work demonstrates how collaborative AI agents equipped with structured knowledge representations can enhance efficiency, consistency, and quality in the engineering design process.

[217] RxSafeBench: Identifying Medication Safety Issues of Large Language Models in Simulated Consultation

Jiahao Zhao, Luxin Xu, Minghuan Tan, Lichao Zhang, Ahmadreza Argha, Hamid Alinejad-Rokny, Min Yang

Main category: cs.AI

TL;DR: RxSafeBench is the first comprehensive benchmark for evaluating medication safety in LLMs, revealing that current models struggle with integrating contraindication and interaction knowledge, especially when risks are implied rather than explicit.

Details

Motivation: Limited research on LLM medication safety due to lack of real-world datasets and privacy constraints, plus underexplored evaluation in realistic clinical consultation settings.

Method: Proposed framework simulates clinical consultations with embedded medication risks, created RxRisk DB with 6,725 contraindications and 28,781 drug interactions, and used two-stage filtering to build RxSafeBench with 2,443 high-quality scenarios.

Result: Current LLMs struggle to integrate contraindication and interaction knowledge, particularly when risks are implied rather than explicit, highlighting key challenges in ensuring medication safety.

Conclusion: RxSafeBench advances safer and more trustworthy AI-driven clinical decision support by providing the first comprehensive medication safety benchmark for LLMs, with insights for improving reliability through better prompting and task-specific tuning.

Abstract: Numerous medical systems powered by Large Language Models (LLMs) have achieved remarkable progress in diverse healthcare tasks. However, research on their medication safety remains limited due to the lack of real world datasets, constrained by privacy and accessibility issues. Moreover, evaluation of LLMs in realistic clinical consultation settings, particularly regarding medication safety, is still underexplored. To address these gaps, we propose a framework that simulates and evaluates clinical consultations to systematically assess the medication safety capabilities of LLMs. Within this framework, we generate inquiry diagnosis dialogues with embedded medication risks and construct a dedicated medication safety database, RxRisk DB, containing 6,725 contraindications, 28,781 drug interactions, and 14,906 indication-drug pairs. A two-stage filtering strategy ensures clinical realism and professional quality, resulting in the benchmark RxSafeBench with 2,443 high-quality consultation scenarios. We evaluate leading open-source and proprietary LLMs using structured multiple choice questions that test their ability to recommend safe medications under simulated patient contexts. Results show that current LLMs struggle to integrate contraindication and interaction knowledge, especially when risks are implied rather than explicit. Our findings highlight key challenges in ensuring medication safety in LLM-based systems and provide insights into improving reliability through better prompting and task-specific tuning. RxSafeBench offers the first comprehensive benchmark for evaluating medication safety in LLMs, advancing safer and more trustworthy AI-driven clinical decision support.

[218] Monitor-Generate-Verify (MGV):Formalising Metacognitive Theory for Language Model Reasoning

Nick Oh, Fernand Gobet

Main category: cs.AI

TL;DR: The paper proposes a Monitor-Generate-Verify (MGV) framework that extends existing Generate-Verify paradigms by adding explicit monitoring processes based on metacognitive theories, addressing the prefix dominance trap where models commit early to suboptimal reasoning paths.

Details

Motivation: Current test-time reasoning architectures exclude monitoring processes that determine when and how reasoning should begin, leading to the prefix dominance trap where models commit early to suboptimal reasoning paths and rarely recover, causing about 20% accuracy loss.

Method: Formalizing Flavell’s and Nelson and Narens’ metacognitive theories into computational specifications to create the MGV framework, which adds explicit monitoring that captures metacognitive experiences before generation begins and refines future monitoring through verification feedback.

Result: No empirical validation is presented, but the work provides the first systematic computational translation of foundational metacognitive theories and offers a principled vocabulary for understanding reasoning system failures.

Conclusion: The MGV framework suggests specific architectural interventions for future test-time reasoning designs by incorporating metacognitive monitoring to address the prefix dominance trap and improve reasoning system performance.

Abstract: Test-time reasoning architectures such as those following the Generate-Verify paradigm – where a model iteratively refines or verifies its own generated outputs – prioritise generation and verification but exclude the monitoring processes that determine when and how reasoning should begin. This omission may contribute to the prefix dominance trap, in which models commit early to suboptimal reasoning paths and seldom recover, yielding roughly 20% accuracy loss. We address this architectural gap by formalising Flavell’s and Nelson and Narens’ metacognitive theories into computational specifications, proposing the Monitor-Generate-Verify (MGV) framework. MGV extends the Generate-Verify paradigm by adding explicit monitoring that captures metacognitive experiences (from difficulty assessments to confidence judgements) before generation begins and refines future monitoring through verification feedback. Though we present no empirical validation, this work provides the first systematic computational translation of foundational metacognitive theories, offering a principled vocabulary for understanding reasoning system failures and suggesting specific architectural interventions for future test-time reasoning designs.

[219] Post-Training LLMs as Better Decision-Making Agents: A Regret-Minimization Approach

Chanwoo Park, Ziyang Chen, Asuman Ozdaglar, Kaiqing Zhang

Main category: cs.AI

TL;DR: Iterative RMFT is a fine-tuning method that repeatedly distills low-regret decision trajectories into LLMs to improve their decision-making capabilities in interactive environments.

Details

Motivation: LLMs struggle with decision-making tasks, failing to achieve low regret or effective exploration-exploitation tradeoffs, despite being increasingly deployed as agents in dynamic environments.

Method: Iterative regret-minimization fine-tuning where models roll out multiple decision trajectories, select the k-lowest regret ones, and fine-tune themselves on these trajectories using regret as a training signal.

Result: Improves LLMs’ decision-making performance across diverse models (Transformers, open-weight LLMs, GPT-4o mini) and enables generalization across tasks with varying horizons, action spaces, and contexts.

Conclusion: Iterative RMFT provides a principled and general post-training framework for enhancing LLMs’ decision-making capabilities, with theoretical support showing Transformers can act as no-regret learners.

Abstract: Large language models (LLMs) are increasingly deployed as “agents” for decision-making (DM) in interactive and dynamic environments. Yet, since they were not originally designed for DM, recent studies show that LLMs can struggle even in basic online DM problems, failing to achieve low regret or an effective exploration-exploitation tradeoff. To address this, we introduce Iterative Regret-Minimization Fine-Tuning (Iterative RMFT), a post-training procedure that repeatedly distills low-regret decision trajectories back into the base model. At each iteration, the model rolls out multiple decision trajectories, selects the k-lowest regret ones, and fine-tunes itself on them. Unlike prior methods that (a) distill action sequences from known DM algorithms or (b) rely on manually crafted chain-of-thought templates, our approach leverages the regret metric to elicit the model’s own DM ability and reasoning rationales. This reliance on model-generated reasoning avoids rigid output engineering and provides more flexible, natural-language training signals. Empirical results show that Iterative RMFT improves LLMs’ DM performance across diverse models - from Transformers with numerical input/output, to open-weight LLMs, and advanced closed-weight models like GPT-4o mini. Its flexibility in output and reasoning formats enables generalization across tasks with varying horizons, action spaces, reward processes, and natural-language contexts. Finally, we provide theoretical insight showing that a single-layer Transformer under this paradigm can act as a no-regret learner in a simplified setting. Overall, Iterative RMFT offers a principled and general post-training framework for enhancing LLMs’ decision-making capabilities.

[220] The Peril of Preference: Why GRPO fails on Ordinal Rewards

Anisha Garg, Ganesh Venkatesh

Main category: cs.AI

TL;DR: CoRPO improves on GRPO by using an adaptive baseline that prevents positive reinforcement of failed trajectories and transitions to relative preference mode once quality threshold is met, enabling better learning from ordinal rewards.

Details

Motivation: GRPO's simplicity becomes problematic when using ordinal rewards, as its group-average baseline can positively reinforce failed trajectories and incorrect behavior.

Method: CoRPO uses an adaptive baseline with minimum quality threshold to ensure failed solutions are never positively reinforced, then transitions to relative preference mode for optimization.

Result: Empirical validation on code verification task shows more stable convergence and better out-of-domain generalization compared to GRPO.

Conclusion: CoRPO represents a critical step in enabling LLMs to learn from rich, multi-dimensional feedback, progressing from binary to ordinal rewards toward denser supervision.

Abstract: Group-relative Policy Optimization’s (GRPO) simplicity makes it highly desirable for adapting LLMs to become experts at specific tasks. But this simplicity also makes it ill-specified as we seek to enhance RL training with richer, non-binary feedback. When using ordinal rewards to give partial credit, GRPO’s simplicity starts to hurt, as its group-average baseline often assigns a positive advantage to failed trajectories and reinforces incorrect behavior. We introduce Correctness Relative Policy Optimization (CoRPO), a new formulation that solves this flaw. CoRPO uses an adaptive baseline that enforces a minimum quality threshold, ensuring failed solutions are never positively reinforced. Once the policy consistently meets this threshold, the baseline automatically transitions to a relative preference mode, pushing the model to find optimal solutions rather than just “acceptable” ones. We empirically validate CoRPO on a code verification task, where it demonstrates more stable convergence and better out-of-domain generalization. This work represents a critical step in our broader research program to enable LLMs to learn genuinely new capabilities through reinforcement learning. We achieve this by enabling LLMs to learn from rich, multi-dimensional feedback

progressing from binary to ordinal rewards in this work, and onward to denser, per-step supervision.

[221] Beyond Shortest Path: Agentic Vehicular Routing with Semantic Context

Carnot Braun, Rafael O. Jarczewski, Gabriel U. Talasso, Leandro A. Villas, Allan M. de Souza

Main category: cs.AI

TL;DR: PAVe combines classical routing algorithms with LLM-based semantic reasoning to create personalized vehicle routing that understands complex human contexts like multi-step tasks and preferences.

Details

Motivation: Traditional routing systems only optimize singular metrics and cannot interpret complex human contexts like multi-step tasks, situational constraints, or urgent needs.

Method: Uses a hybrid approach: multi-objective Dijkstra algorithm generates candidate routes, then an LLM agent evaluates them against user tasks, preferences, and avoidance rules using pre-processed geospatial POI cache.

Result: Achieved over 88% accuracy in initial route selections with a local model, successfully integrating complex user intent into appropriate route modifications in realistic urban scenarios.

Conclusion: Combining classical routing algorithms with LLM-based semantic reasoning is a robust and effective approach for creating personalized, adaptive, and scalable urban mobility solutions.

Abstract: Traditional vehicle routing systems efficiently optimize singular metrics like time or distance, and when considering multiple metrics, they need more processes to optimize . However, they lack the capability to interpret and integrate the complex, semantic, and dynamic contexts of human drivers, such as multi-step tasks, situational constraints, or urgent needs. This paper introduces and evaluates PAVe (Personalized Agentic Vehicular Routing), a hybrid agentic assistant designed to augment classical pathfinding algorithms with contextual reasoning. Our approach employs a Large Language Model (LLM) agent that operates on a candidate set of routes generated by a multi-objective (time, CO2) Dijkstra algorithm. The agent evaluates these options against user-provided tasks, preferences, and avoidance rules by leveraging a pre-processed geospatial cache of urban Points of Interest (POIs). In a benchmark of realistic urban scenarios, PAVe successfully used complex user intent into appropriate route modifications, achieving over 88% accuracy in its initial route selections with a local model. We conclude that combining classical routing algorithms with an LLM-based semantic reasoning layer is a robust and effective approach for creating personalized, adaptive, and scalable solutions for urban mobility optimization.

[222] Promoting Sustainable Web Agents: Benchmarking and Estimating Energy Consumption through Empirical and Theoretical Analysis

Lars Krupp, Daniel Geißler, Vishal Banwari, Paul Lukowicz, Jakob Karolus

Main category: cs.AI

TL;DR: This paper explores the sustainability issues of web agents, analyzing their energy and CO2 costs through theoretical estimation and empirical benchmarking.

Details

Motivation: Web agent research is thriving but sustainability issues remain largely unexplored, highlighting the urgency to understand energy and environmental impacts.

Method: Combined theoretical estimation and empirical benchmarking to analyze energy and CO2 costs of different web agent approaches.

Result: Different web agent philosophies significantly impact energy consumption, and more energy doesn’t necessarily mean better results. Lack of transparency in model parameters limits accurate energy estimation.

Conclusion: Advocates for changing how we evaluate web agents by including dedicated energy consumption metrics in benchmarks to address sustainability concerns.

Abstract: Web agents, like OpenAI’s Operator and Google’s Project Mariner, are powerful agentic systems pushing the boundaries of Large Language Models (LLM). They can autonomously interact with the internet at the user’s behest, such as navigating websites, filling search masks, and comparing price lists. Though web agent research is thriving, induced sustainability issues remain largely unexplored. To highlight the urgency of this issue, we provide an initial exploration of the energy and $CO_2$ cost associated with web agents from both a theoretical -via estimation- and an empirical perspective -by benchmarking. Our results show how different philosophies in web agent creation can severely impact the associated expended energy, and that more energy consumed does not necessarily equate to better results. We highlight a lack of transparency regarding disclosing model parameters and processes used for some web agents as a limiting factor when estimating energy consumption. Our work contributes towards a change in thinking of how we evaluate web agents, advocating for dedicated metrics measuring energy consumption in benchmarks.

[223] Optimizing Sensor Placement in Urban Storm Sewers: A Data-Driven Sparse Sensing Approach

Zihang Ding, Kun Zhang

Main category: cs.AI

TL;DR: A data-driven sparse sensing framework that optimizes sensor placement in stormwater systems to accurately reconstruct peak flowrates using minimal sensors, achieving high efficiency with just 3 sensors among 77 nodes.

Details

Motivation: Urban surface water flooding is increasing due to intense rainfall overwhelming drainage systems, but practical constraints in time, budget, and technology hinder high-resolution flood monitoring and prediction.

Method: Integrated EPA-SWMM with data-driven sparse sensing framework, using singular value decomposition for dimensionality reduction and QR factorization for optimal sensor allocation based on simulated training datasets.

Result: Three optimally placed sensors achieved excellent reconstruction performance with Nash-Sutcliffe Efficiency values of 0.92-0.95 (25th-75th percentiles), showing good robustness to measurement uncertainty.

Conclusion: The DSS framework balances computational efficiency and physical interpretability, enabling high-accuracy flow reconstruction with minimal sensors, and can be integrated with predictive models for flood early warning under limited resources.

Abstract: Urban surface water flooding, triggered by intense rainfall overwhelming drainage systems, is increasingly frequent and widespread. While flood prediction and monitoring in high spatial-temporal resolution are desired, practical constraints in time, budget, and technology hinder its full implementation. How to monitor urban drainage networks and predict flow conditions under constrained resource is a major challenge. This study presents a data-driven sparse sensing (DSS) framework, integrated with EPA-SWMM, to optimize sensor placement and reconstruct peak flowrates in a stormwater system, using the Woodland Avenue catchment in Duluth, Minnesota, as a case study. We utilized a SWMM model to generate a training dataset of peak flowrate profiles across the stormwater network. Furthermore, we applied DSS - leveraging singular value decomposition for dimensionality reduction and QR factorization for sensor allocation - to identify the optimal monitoring nodes based on the simulated training dataset. We then validated the representativeness of these identified monitoring nodes by comparing the DSS-reconstructed peak flowrate profiles with those obtained from SWMM. Three optimally placed sensors among 77 nodes achieved satisfactory reconstruction performance with Nash-Sutcliffe Efficiency (NSE) values of 0.92-0.95 (25th to 75th percentiles). In addition, the model showed good robustness to uncertainty in measurements. Its robustness to sensor failures is location-dependent and improves with the number of sensors deployed. The framework balances computational efficiency and physical interpretability, enabling high-accuracy flow reconstruction with minimal sensors. This DSS framework can be further integrated with predictive models to realize flood early warning and real-time control under limited sensing and monitoring resource.

[224] Jr. AI Scientist and Its Risk Report: Autonomous Scientific Exploration from a Baseline Paper

Atsuyuki Miyai, Mashiro Toyooka, Takashi Otonari, Zaiying Zhao, Kiyoharu Aizawa

Main category: cs.AI

TL;DR: Jr. AI Scientist is an autonomous AI system that mimics novice researcher workflow to analyze paper limitations, formulate hypotheses, conduct experiments, and write papers, demonstrating improved performance over fully automated systems but with identified limitations and risks.

Details

Motivation: To understand current capabilities and risks of AI Scientist systems for ensuring trustworthy AI-driven scientific progress while preserving academic integrity.

Method: Developed Jr. AI Scientist that follows novice researcher workflow: analyzes baseline paper limitations, formulates novel hypotheses, validates through rigorous experimentation, and writes papers. Uses modern coding agents for complex multi-file implementations.

Result: Jr. AI Scientist generates papers receiving higher review scores than existing fully automated systems, but author evaluations and Agents4Science reviews identified important limitations and risks.

Conclusion: Current AI Scientist systems show improved capabilities but have limitations and risks that need addressing for trustworthy application in scientific research.

Abstract: Understanding the current capabilities and risks of AI Scientist systems is essential for ensuring trustworthy and sustainable AI-driven scientific progress while preserving the integrity of the academic ecosystem. To this end, we develop Jr. AI Scientist, a state-of-the-art autonomous AI scientist system that mimics the core research workflow of a novice student researcher: Given the baseline paper from the human mentor, it analyzes its limitations, formulates novel hypotheses for improvement, validates them through rigorous experimentation, and writes a paper with the results. Unlike previous approaches that assume full automation or operate on small-scale code, Jr. AI Scientist follows a well-defined research workflow and leverages modern coding agents to handle complex, multi-file implementations, leading to scientifically valuable contributions. For evaluation, we conducted automated assessments using AI Reviewers, author-led evaluations, and submissions to Agents4Science, a venue dedicated to AI-driven scientific contributions. The findings demonstrate that Jr. AI Scientist generates papers receiving higher review scores than existing fully automated systems. Nevertheless, we identify important limitations from both the author evaluation and the Agents4Science reviews, indicating the potential risks of directly applying current AI Scientist systems and key challenges for future research. Finally, we comprehensively report various risks identified during development. We hope these insights will deepen understanding of current progress and risks in AI Scientist development.

[225] Are We Asking the Right Questions? On Ambiguity in Natural Language Queries for Tabular Data Analysis

Daniel Gomm, Cornelius Wolff, Madelon Hulsebos

Main category: cs.AI

TL;DR: The paper reframes query ambiguity in natural language interfaces to tabular data as a cooperative interaction feature, developing a framework to distinguish cooperative vs uncooperative queries and analyzing 15 datasets to reveal evaluation limitations.

Details

Motivation: Natural language interfaces to tabular data must handle query ambiguities, which are typically treated as deficiencies rather than cooperative interaction features.

Method: Developed a principled framework to distinguish cooperative queries (resolvable interpretations) from uncooperative queries, and applied it to analyze queries in 15 popular tabular question answering and analysis datasets.

Result: Found uncontrolled mixing of query types in existing datasets that inadequately evaluates both execution accuracy and interpretation capabilities of systems.

Conclusion: The framework shifts perspective from fixing ambiguity to embracing cooperation in query resolution, enabling more informed design and evaluation of natural language interfaces for tabular data.

Abstract: Natural language interfaces to tabular data must handle ambiguities inherent to queries. Instead of treating ambiguity as a deficiency, we reframe it as a feature of cooperative interaction, where the responsibility of query specification is shared among the user and the system. We develop a principled framework distinguishing cooperative queries, i.e., queries that yield a resolvable interpretation, from uncooperative queries that cannot be resolved. Applying the framework to evaluations for tabular question answering and analysis, we analyze the queries in 15 popular datasets, and observe an uncontrolled mixing of query types neither adequate for evaluating a system’s execution accuracy nor for evaluating interpretation capabilities. Our framework and analysis of queries shifts the perspective from fixing ambiguity to embracing cooperation in resolving queries. This reflection enables more informed design and evaluation for natural language interfaces for tabular data, for which we outline implications and directions for future research.

[226] Question the Questions: Auditing Representation in Online Deliberative Processes

Soham De, Lodewijk Gelauff, Ashish Goel, Smitha Milli, Ariel Procaccia, Alice Siu

Main category: cs.AI

TL;DR: This paper introduces an auditing framework using justified representation (JR) to measure how well selected questions represent participants’ interests in deliberative processes, with efficient algorithms and applications to real-world deliberations.

Details

Motivation: In deliberative processes like citizens' assemblies, only a limited number of participant questions can be selected for expert panels due to time constraints, creating a need to ensure the chosen questions represent all participants' interests.

Method: Developed auditing framework based on justified representation (JR) concept from social choice theory, with efficient algorithms (O(mn log n) runtime) for auditing JR in general utility settings. Applied methods to compare different question selection approaches.

Result: Applied auditing to historical deliberations, comparing moderator-selected questions, ILP-optimized questions, and LLM-generated summary questions. Results show both promise and limitations of LLMs in supporting deliberation.

Conclusion: The framework enables practitioners to audit and improve representation in deliberative processes, with integration into an online platform used across 50+ countries for hundreds of deliberations.

Abstract: A central feature of many deliberative processes, such as citizens’ assemblies and deliberative polls, is the opportunity for participants to engage directly with experts. While participants are typically invited to propose questions for expert panels, only a limited number can be selected due to time constraints. This raises the challenge of how to choose a small set of questions that best represent the interests of all participants. We introduce an auditing framework for measuring the level of representation provided by a slate of questions, based on the social choice concept known as justified representation (JR). We present the first algorithms for auditing JR in the general utility setting, with our most efficient algorithm achieving a runtime of $O(mn\log n)$, where $n$ is the number of participants and $m$ is the number of proposed questions. We apply our auditing methods to historical deliberations, comparing the representativeness of (a) the actual questions posed to the expert panel (chosen by a moderator), (b) participants’ questions chosen via integer linear programming, (c) summary questions generated by large language models (LLMs). Our results highlight both the promise and current limitations of LLMs in supporting deliberative processes. By integrating our methods into an online deliberation platform that has been used for over hundreds of deliberations across more than 50 countries, we make it easy for practitioners to audit and improve representation in future deliberations.

[227] VeriCoT: Neuro-symbolic Chain-of-Thought Validation via Logical Consistency Checks

Yu Feng, Nathaniel Weir, Kaj Bostrom, Sam Bayless, Darion Cassel, Sapana Chaudhary, Benjamin Kiesl-Reiter, Huzefa Rangwala

Main category: cs.AI

TL;DR: VeriCoT is a neuro-symbolic method that extracts and verifies formal logical arguments from Chain-of-Thought reasoning using first-order logic and automated solvers to identify flawed reasoning.

Details

Motivation: LLMs using Chain-of-Thought reasoning cannot reliably verify their own logic, even when reaching correct answers, which undermines trust in high-stakes scenarios.

Method: VeriCoT formalizes each CoT reasoning step into first-order logic, identifies premises grounded in source context, commonsense knowledge, or prior steps, and uses automated solvers for verification.

Result: Experiments on ProofWriter, LegalBench, and BioASQ datasets show VeriCoT effectively identifies flawed reasoning and serves as a strong predictor of final answer correctness.

Conclusion: VeriCoT’s verification signal can be leveraged for inference-time self-reflection, supervised fine-tuning, and preference fine-tuning, improving reasoning validity and accuracy.

Abstract: LLMs can perform multi-step reasoning through Chain-of-Thought (CoT), but they cannot reliably verify their own logic. Even when they reach correct answers, the underlying reasoning may be flawed, undermining trust in high-stakes scenarios. To mitigate this issue, we introduce VeriCoT, a neuro-symbolic method that extracts and verifies formal logical arguments from CoT reasoning. VeriCoT formalizes each CoT reasoning step into first-order logic and identifies premises that ground the argument in source context, commonsense knowledge, or prior reasoning steps. The symbolic representation enables automated solvers to verify logical validity while the NL premises allow humans and systems to identify ungrounded or fallacious reasoning steps. Experiments on the ProofWriter, LegalBench, and BioASQ datasets show VeriCoT effectively identifies flawed reasoning, and serves as a strong predictor of final answer correctness. We also leverage VeriCoT’s verification signal for (1) inference-time self-reflection, (2) supervised fine-tuning (SFT) on VeriCoT-distilled datasets and (3) preference fine-tuning (PFT) with direct preference optimization (DPO) using verification-based pairwise rewards, further improving reasoning validity and accuracy.

[228] Discussion Graph Semantics of First-Order Logic with Equality for Reasoning about Discussion and Argumentation

Ryuta Arisaka

Main category: cs.AI

TL;DR: This paper introduces discussion-graph semantics for first-order logic with equality, generalizes Dung’s argumentation extensions to handle equivalent nodes, and shows these generalized extensions are first-order characterizable.

Details

Motivation: To address the lack of a formal reasoning framework capable of handling diverse discussion and argumentation models in AI.

Method: Formulated discussion-graph semantics for first-order logic with equality and generalized Dung’s notion of extensions to handle equivalent graph nodes.

Result: The generalized extensions are first-order characterizable within the proposed discussion-graph semantics, with propositional characterizability of all Dung’s extensions as an immediate consequence.

Conclusion: The paper successfully connects formal logic with argumentation theory by providing a comprehensive framework that generalizes existing argumentation models and establishes their logical characterizability.

Abstract: We make three contributions. First, we formulate a discussion-graph semantics for first-order logic with equality, enabling reasoning about discussion and argumentation in AI more generally than before. This addresses the current lack of a formal reasoning framework capable of handling diverse discussion and argumentation models. Second, we generalise Dung’s notion of extensions to cases where two or more graph nodes in an argumentation framework are equivalent. Third, we connect these two contributions by showing that the generalised extensions are first-order characterisable within the proposed discussion-graph semantics. Propositional characterisability of all Dung’s extensions is an immediate consequence.

[229] “Let’s Agree to Disagree”: Investigating the Disagreement Problem in Explainable AI for Text Summarization

Seema Aswani, Sujala D. Shetty

Main category: cs.AI

TL;DR: This paper addresses the disagreement problem in XAI for text summarization by proposing Regional Explainable AI (RXAI), a segmentation-based approach that reduces conflicts between different explanation methods.

Details

Motivation: The disagreement problem occurs when different XAI methods provide conflicting explanations for the same model outcome, which undermines trust in model interpretations and hinders secure AI applications.

Method: Proposed Regional Explainable AI (RXAI) that divides articles into smaller coherent segments using sentence transformers and clustering, then applies XAI methods locally to generate more consistent explanations.

Result: RXAI substantially reduces disagreement between XAI methods on benchmark datasets (Xsum and CNN/Daily Mail), with localized explanations proving more consistent than full-text explanations.

Conclusion: The segmentation-based RXAI approach successfully mitigates the disagreement problem in text summarization XAI, enhancing explanation consistency and trustworthiness of AI-generated summaries.

Abstract: Explainable Artificial Intelligence (XAI) methods in text summarization are essential for understanding the model behavior and fostering trust in model-generated summaries. Despite the effectiveness of XAI methods, recent studies have highlighted a key challenge in this area known as the “disagreement problem”. This problem occurs when different XAI methods yield conflicting explanations for the same model outcome. Such discrepancies raise concerns about the consistency of explanations and reduce confidence in model interpretations, which is crucial for secure and accountable AI applications. This work is among the first to empirically investigate the disagreement problem in text summarization, demonstrating that such discrepancies are widespread in state-of-the-art summarization models. To address this gap, we propose Regional Explainable AI (RXAI) a novel segmentation-based approach, where each article is divided into smaller, coherent segments using sentence transformers and clustering. We use XAI methods on text segments to create localized explanations that help reduce disagreement between different XAI methods, thereby enhancing the trustworthiness of AI-generated summaries. Our results illustrate that the localized explanations are more consistent than full-text explanations. The proposed approach is validated using two benchmark summarization datasets, Extreme summarization (Xsum) and CNN/Daily Mail, indicating a substantial decrease in disagreement. Additionally, the interactive JavaScript visualization tool is developed to facilitate easy, color-coded exploration of attribution scores at the sentence level, enhancing user comprehension of model explanations.

[230] Building Altruistic and Moral AI Agent with Brain-inspired Emotional Empathy Mechanisms

Feifei Zhao, Hui Feng, Haibo Tong, Zhengqiang Han, Erliang Lin, Enmeng Lu, Yinqian Sun, Yi Zeng

Main category: cs.AI

TL;DR: A brain-inspired emotional empathy-driven altruistic decision-making model that simulates human neural circuits for empathy, enabling AI to exhibit consistent altruistic behaviors across various scenarios including rescue missions, multi-agent gaming, and moral dilemmas.

Details

Motivation: Existing AI ethical constraints based on principles and rules are insufficient for long-term stability and generalization. Emotional empathy intrinsically motivates altruistic behaviors through emotional sharing and contagion mechanisms, providing a more natural foundation for ethical AI.

Method: Simulated human neural circuits for shared self-other perception-mirroring-empathy, where empathy directly impacts dopamine release to form intrinsic altruistic motivation. Tested across emotional contagion-integrated two-agent rescue, multi-agent gaming, and robotic emotional empathy interaction scenarios.

Result: The model exhibits consistent altruistic behaviors across all experimental settings. Validated positive correlation between empathy levels and altruistic preferences (matching psychological findings), and showed how interaction partners’ empathy influences behavioral patterns. Performs well in moral dilemmas, partially observable environments, and adversarial defense scenarios.

Conclusion: Provides preliminary exploration of human-like empathy-driven altruistic moral decision making, contributing potential perspectives for developing ethically-aligned AI that goes beyond rule-based constraints to intrinsic motivation.

Abstract: As AI closely interacts with human society, it is crucial to ensure that its behavior is safe, altruistic, and aligned with human ethical and moral values. However, existing research on embedding ethical considerations into AI remains insufficient, and previous external constraints based on principles and rules are inadequate to provide AI with long-term stability and generalization capabilities. Emotional empathy intrinsically motivates altruistic behaviors aimed at alleviating others’ negative emotions through emotional sharing and contagion mechanisms. Motivated by this, we draw inspiration from the neural mechanism of human emotional empathy-driven altruistic decision making, and simulate the shared self-other perception-mirroring-empathy neural circuits, to construct a brain-inspired emotional empathy-driven altruistic decision-making model. Here, empathy directly impacts dopamine release to form intrinsic altruistic motivation. The proposed model exhibits consistent altruistic behaviors across three experimental settings: emotional contagion-integrated two-agent altruistic rescue, multi-agent gaming, and robotic emotional empathy interaction scenarios. In-depth analyses validate the positive correlation between empathy levels and altruistic preferences (consistent with psychological behavioral experiment findings), while also demonstrating how interaction partners’ empathy levels influence the agent’s behavioral patterns. We further test the proposed model’s performance and stability in moral dilemmas involving conflicts between self-interest and others’ well-being, partially observable environments, and adversarial defense scenarios. This work provides preliminary exploration of human-like empathy-driven altruistic moral decision making, contributing potential perspectives for developing ethically-aligned AI.

Dutao Zhang, Nicolas Rafael Arroyo Arias, YuLong He, Sergey Kovalchuk

Main category: cs.AI

TL;DR: Two-stage training framework combining contrastive learning and conditional decoding for controllable code generation with style preservation.

Details

Motivation: Controllable code generation that maintains specified styles while preserving functionality remains challenging.

Method: Two-stage approach: 1) contrastive learning to align code style representations with semantic/structural features, 2) fine-tuning language model (Flan-T5) conditioned on learned style vector for guided generation.

Result: Supports style interpolation and user personalization via lightweight mixing; improved stylistic control without sacrificing code correctness.

Conclusion: First approach combining contrastive alignment with conditional decoding for style-guided code generation.

Abstract: Controllable code generation, the ability to synthesize code that follows a specified style while maintaining functionality, remains a challenging task. We propose a two-stage training framework combining contrastive learning and conditional decoding to enable flexible style control. The first stage aligns code style representations with semantic and structural features. In the second stage, we fine-tune a language model (e.g., Flan-T5) conditioned on the learned style vector to guide generation. Our method supports style interpolation and user personalization via lightweight mixing. Compared to prior work, our unified framework offers improved stylistic control without sacrificing code correctness. This is among the first approaches to combine contrastive alignment with conditional decoding for style-guided code generation.

[232] Evaluating LLM-Contaminated Crowdsourcing Data Without Ground Truth

Yichi Zhang, Jinlong Pang, Zhaowei Zhu, Yang Liu

Main category: cs.AI

TL;DR: Proposes a peer prediction mechanism to detect LLM-assisted cheating in crowdsourcing annotation tasks without needing ground truth or high-dimensional text data.

Details

Motivation: Address the challenge of LLM-generated responses compromising human feedback datasets in crowdsourcing, especially for annotation tasks where existing text-based detection methods are unsuitable.

Method: Uses peer prediction to quantify correlations between worker answers while conditioning on LLM-generated labels, with a training-free scoring mechanism that accounts for LLM collusion.

Result: Establishes theoretical guarantees and demonstrates empirical robustness in detecting low-effort cheating on real-world crowdsourcing datasets.

Conclusion: Peer prediction can effectively mitigate LLM-assisted cheating in crowdsourcing annotation tasks without requiring ground truth or complex training data.

Abstract: The recent success of generative AI highlights the crucial role of high-quality human feedback in building trustworthy AI systems. However, the increasing use of large language models (LLMs) by crowdsourcing workers poses a significant challenge: datasets intended to reflect human input may be compromised by LLM-generated responses. Existing LLM detection approaches often rely on high-dimensional training data such as text, making them unsuitable for annotation tasks like multiple-choice labeling. In this work, we investigate the potential of peer prediction – a mechanism that evaluates the information within workers’ responses without using ground truth – to mitigate LLM-assisted cheating in crowdsourcing with a focus on annotation tasks. Our approach quantifies the correlations between worker answers while conditioning on (a subset of) LLM-generated labels available to the requester. Building on prior research, we propose a training-free scoring mechanism with theoretical guarantees under a crowdsourcing model that accounts for LLM collusion. We establish conditions under which our method is effective and empirically demonstrate its robustness in detecting low-effort cheating on real-world crowdsourcing datasets.

[233] Structured Debate Improves Corporate Credit Reasoning in Financial AI

Yoonjin Lee, Munhee Kim, Hanbi Choi, Juhyeon Park, Seungho Lyoo, Woojin Park

Main category: cs.AI

TL;DR: This paper develops two LLM-based systems for corporate credit assessment that generate structured reasoning from non-financial evidence, with a multi-agent debate system showing superior reasoning quality over a single-agent approach.

Details

Motivation: Current financial AI focuses on numerical prediction but lacks support for interpretive judgments in loan evaluation, particularly for qualitative non-financial indicators that influence loan repayment outcomes.

Method: Developed two LLM-based systems: a non-adversarial single-agent system (NAS) and a debate-based multi-agent system (KPD-MADS) using Karl Popper’s critical dialogue framework with a ten-step structured interaction protocol.

Result: Both systems achieved significant productivity gains (NAS: 11.55s; KPD-MADS: 91.97s vs human baseline: 1920s). KPD-MADS demonstrated superior reasoning quality with higher ratings in explanatory adequacy, practical applicability, and usability.

Conclusion: Structured multi-agent interaction enhances reasoning rigor and interpretability in financial AI, advancing scalable and defensible automation in corporate credit assessment.

Abstract: Despite advances in financial AI, the automation of evidence-based reasoning remains unresolved in corporate credit assessment, where qualitative non-financial indicators exert decisive influence on loan repayment outcomes yet resist formalization. Existing approaches focus predominantly on numerical prediction and provide limited support for the interpretive judgments required in professional loan evaluation. This study develops and evaluates two operational large language model (LLM)-based systems designed to generate structured reasoning from non-financial evidence. The first is a non-adversarial single-agent system (NAS) that produces bidirectional analysis through a single-pass reasoning pipeline. The second is a debate-based multi-agent system (KPD-MADS) that operationalizes adversarial verification through a ten-step structured interaction protocol grounded in Karl Popper’s critical dialogue framework. Both systems were applied to three real corporate cases and evaluated by experienced credit risk professionals. Compared to manual expert reporting, both systems achieved substantial productivity gains (NAS: 11.55 s per case; KPD-MADS: 91.97 s; human baseline: 1920 s). The KPD-MADS demonstrated superior reasoning quality, receiving higher median ratings in explanatory adequacy (4.0 vs. 3.0), practical applicability (4.0 vs. 3.0), and usability (62.5 vs. 52.5). These findings show that structured multi-agent interaction can enhance reasoning rigor and interpretability in financial AI, advancing scalable and defensible automation in corporate credit assessment.

[234] Seg the HAB: Language-Guided Geospatial Algae Bloom Reasoning and Segmentation

Patterson Hsieh, Jerry Yeh, Mao-Chi He, Wen-Han Hsieh, Elvis Hsieh

Main category: cs.AI

TL;DR: ALGOS is a segmentation-and-reasoning system that combines remote sensing image understanding with severity estimation for harmful algal bloom monitoring, achieving robust performance on both segmentation and severity-level prediction.

Details

Motivation: Climate change is intensifying harmful algal blooms (cyanobacteria) that threaten aquatic ecosystems and human health, while traditional monitoring methods are labor-intensive and limited in coverage. Vision-language models show potential but need improved reasoning and severity quantification capabilities.

Method: Integrates GeoSAM-assisted human evaluation for high-quality segmentation mask curation and fine-tunes vision language model on severity prediction using NASA’s Cyanobacteria Aggregated Manual Labels (CAML) dataset.

Result: ALGOS achieves robust performance on both segmentation and severity-level estimation tasks, demonstrating effective automated monitoring capabilities.

Conclusion: The system paves the way toward practical and automated cyanobacterial monitoring systems by combining segmentation and reasoning capabilities for comprehensive HAB assessment.

Abstract: Climate change is intensifying the occurrence of harmful algal bloom (HAB), particularly cyanobacteria, which threaten aquatic ecosystems and human health through oxygen depletion, toxin release, and disruption of marine biodiversity. Traditional monitoring approaches, such as manual water sampling, remain labor-intensive and limited in spatial and temporal coverage. Recent advances in vision-language models (VLMs) for remote sensing have shown potential for scalable AI-driven solutions, yet challenges remain in reasoning over imagery and quantifying bloom severity. In this work, we introduce ALGae Observation and Segmentation (ALGOS), a segmentation-and-reasoning system for HAB monitoring that combines remote sensing image understanding with severity estimation. Our approach integrates GeoSAM-assisted human evaluation for high-quality segmentation mask curation and fine-tunes vision language model on severity prediction using the Cyanobacteria Aggregated Manual Labels (CAML) from NASA. Experiments demonstrate that ALGOS achieves robust performance on both segmentation and severity-level estimation, paving the way toward practical and automated cyanobacterial monitoring systems.

[235] Toward Clinically Grounded Foundation Models in Pathology

Hamid R. Tizhoosh

Main category: cs.AI

TL;DR: Current pathology foundation models fail to deliver expected breakthroughs due to fundamental conceptual mismatches with tissue complexity, showing poor accuracy, robustness, and safety vulnerabilities.

Details

Motivation: To understand why foundation models that revolutionized other domains are underperforming in computational pathology despite high expectations.

Method: Systematic evaluation and analysis of pathology foundation models’ shortcomings, identifying seven root causes through conceptual examination.

Result: Found fundamental weaknesses including low diagnostic accuracy, poor robustness, geometric instability, computational inefficiency, and safety vulnerabilities in current pathology FMs.

Conclusion: Current pathology foundation models are conceptually misaligned with tissue morphology and require a fundamental paradigm shift rather than incremental improvements.

Abstract: In non-medical domains, foundation models (FMs) have revolutionized computer vision and language processing through large-scale self-supervised and multimodal learning. Consequently, their rapid adoption in computational pathology was expected to deliver comparable breakthroughs in cancer diagnosis, prognostication, and multimodal retrieval. However, recent systematic evaluations reveal fundamental weaknesses: low diagnostic accuracy, poor robustness, geometric instability, heavy computational demands, and concerning safety vulnerabilities. This short paper examines these shortcomings and argues that they stem from deeper conceptual mismatches between the assumptions underlying generic foundation modeling in mainstream AI and the intrinsic complexity of human tissue. Seven interrelated causes are identified: biological complexity, ineffective self-supervision, overgeneralization, excessive architectural complexity, lack of domain-specific innovation, insufficient data, and a fundamental design flaw related to tissue patch size. These findings suggest that current pathology foundation models remain conceptually misaligned with the nature of tissue morphology and call for a fundamental rethinking of the paradigm itself.

[236] BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning

Qianli Shen, Daoyuan Chen, Yilun Huang, Zhenqing Ling, Yaliang Li, Bolin Ding, Jingren Zhou

Main category: cs.AI

TL;DR: BOTS is a Bayesian framework for adaptive task selection in reinforcement finetuning of LLMs that balances exploration and exploitation while minimizing rollout costs.

Details

Motivation: Current task selection methods for reinforcement finetuning are inefficient, suffering from high rollout costs, poor adaptivity, or incomplete evidence, leading to wasted computation on trivial or unsolvable tasks.

Method: BOTS uses Bayesian inference to maintain posterior estimates of task difficulty, incorporates both explicit (direct evaluations) and implicit (inferred from evaluations) evidence, and employs Thompson sampling for exploration-exploitation balance with an ultra-light interpolation-based plug-in for practical implementation.

Result: Across diverse domains and LLM scales, BOTS consistently improves data efficiency and performance over baselines and ablations.

Conclusion: BOTS provides a practical and extensible solution for dynamic task selection in reinforcement finetuning, effectively addressing the limitations of existing methods.

Abstract: Reinforcement finetuning (RFT) is a key technique for aligning Large Language Models (LLMs) with human preferences and enhancing reasoning, yet its effectiveness is highly sensitive to which tasks are explored during training. Uniform task sampling is inefficient, wasting computation on tasks that are either trivial or unsolvable, while existing task selection methods often suffer from high rollout costs, poor adaptivity, or incomplete evidence. We introduce BOTS, a unified framework for Bayesian Online Task Selection in LLM reinforcement finetuning. Grounded in Bayesian inference, BOTS adaptively maintains posterior estimates of task difficulty as the model evolves. It jointly incorporates explicit evidence from direct evaluations of selected tasks and implicit evidence inferred from these evaluations for unselected tasks, with Thompson sampling ensuring a principled balance between exploration and exploitation. To make implicit evidence practical, we instantiate it with an ultra-light interpolation-based plug-in that estimates difficulties of unevaluated tasks without extra rollouts, adding negligible overhead. Empirically, across diverse domains and LLM scales, BOTS consistently improves data efficiency and performance over baselines and ablations, providing a practical and extensible solution for dynamic task selection in RFT.

[237] Orion-MSP: Multi-Scale Sparse Attention for Tabular In-Context Learning

Mohamed Bouadi, Pratinav Seth, Aditya Tanna, Vinay Kumar Sankarapu

Main category: cs.AI

TL;DR: Orion-MSP is a new tabular in-context learning architecture that addresses limitations of current models through multi-scale processing, block-sparse attention, and Perceiver-style memory for efficient hierarchical feature interaction.

Details

Motivation: Current tabular ICL models have limitations including single-scale feature processing, quadratic scaling attention, and sequential component processing that prevents iterative refinement and cross-component communication.

Method: Introduces three key innovations: (1) multi-scale processing for hierarchical feature interactions, (2) block-sparse attention combining windowed, global, and random patterns for scalable efficiency, and (3) Perceiver-style memory enabling bidirectional information flow across components.

Result: Across diverse benchmarks, Orion-MSP matches or surpasses state-of-the-art performance while scaling effectively to high-dimensional tables.

Conclusion: Orion-MSP establishes a new standard for efficient tabular in-context learning, addressing key limitations of existing architectures.

Abstract: Tabular data remain the predominant format for real-world applications. Yet, developing effective neural models for tabular data remains challenging due to heterogeneous feature types and complex interactions occurring at multiple scales. Recent advances in tabular in-context learning (ICL), such as TabPFN and TabICL, have achieved state-of-the-art performance comparable to gradient-boosted trees (GBTs) without task-specific fine-tuning. However, current architectures exhibit key limitations: (1) single-scale feature processing that overlooks hierarchical dependencies, (2) dense attention with quadratic scaling in table width, and (3) strictly sequential component processing that prevents iterative representation refinement and cross-component communication. To address these challenges, we introduce Orion-MSP, a tabular ICL architecture featuring three key innovations: (1) multi-scale processing to capture hierarchical feature interactions; (2) block-sparse attention combining windowed, global, and random patterns for scalable efficiency and long-range connectivity; and (3) a Perceiver-style memory enabling safe bidirectional information flow across components. Across diverse benchmarks, Orion-MSP matches or surpasses state-of-the-art performance while scaling effectively to high-dimensional tables, establishing a new standard for efficient tabular in-context learning. The model is publicly available at https://github.com/Lexsi-Labs/Orion-MSP .

[238] SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators

Jonathan Li, Nasim Farahini, Evgenii Iuliugin, Magnus Vesterlund, Christian Haggstrom, Guangtao Wang, Shubhangi Upasani, Ayush Sachdeva, Rui Li, Faline Fu, Chen Wu, Ayesha Siddiqua, John Long, Tuowen Zhao, Matheen Musaddiq, Hakan Zeffer, Yun Du, Mingran Wang, Qinghua Li, Bo Li, Urmish Thakker, Raghu Prabhakar

Main category: cs.AI

TL;DR: SnapStream is a KV cache compression method that enables 4x improved on-chip memory usage with minimal accuracy degradation, designed for production inference systems with static graphs and continuous batching.

Details

Motivation: Large LLMs with 100k+ context lengths require substantial on-chip memory for KV caches, but existing compression techniques like StreamingLLM and SnapKV are not widely adopted in industrial deployments due to framework constraints and unclear accuracy impacts.

Method: Developed SnapStream, a KV cache compression method that can be deployed at scale in systems with static graphs and continuous batching, tested on Llama-3.1-8B-Instruct and DeepSeek-R1.

Result: Achieved 4x improved on-chip memory usage with minimal accuracy degradation on LongBench-v2, AIME24 and LiveCodeBench benchmarks, deployed in a 16-way tensor-parallel DeepSeek-671B system running at 128k context length and up to 1832 tokens/second.

Conclusion: SnapStream successfully implements sparse KV attention techniques in production inference systems, addressing the memory demands of large LLMs while maintaining accuracy in real-world deployment scenarios.

Abstract: The proliferation of 100B+ parameter Large Language Models (LLMs) with 100k+ context length support have resulted in increasing demands for on-chip memory to support large KV caches. Techniques such as StreamingLLM and SnapKV demonstrate how to control KV cache size while maintaining model accuracy. Yet, these techniques are not commonly used within industrial deployments using frameworks like vLLM or SGLang. The reason is twofold: on one hand, the static graphs and continuous batching methodology employed by these frameworks make it difficult to admit modifications to the standard multi-head attention algorithm, while on the other hand, the accuracy implications of such techniques on modern instruction-following and reasoning models are not well understood, obfuscating the need for implementing these techniques. In this paper, we explore these accuracy implications on Llama-3.1-8B-Instruct and DeepSeek-R1, and develop SnapStream, a KV cache compression method that can be deployed at scale. We demonstrate the efficacy of SnapStream in a 16-way tensor-parallel deployment of DeepSeek-671B on SambaNova SN40L accelerators running at 128k context length and up to 1832 tokens per second in a real production setting. SnapStream enables $4\times$ improved on-chip memory usage and introduces minimal accuracy degradation on LongBench-v2, AIME24 and LiveCodeBench. To the best of our knowledge, this is the first implementation of sparse KV attention techniques deployed in a production inference system with static graphs and continuous batching.

cs.SD

[239] MIDI-LLM: Adapting Large Language Models for Text-to-MIDI Music Generation

Shih-Lun Wu, Yoon Kim, Cheng-Zhi Anna Huang

Main category: cs.SD

TL;DR: MIDI-LLM is a language model that generates multitrack MIDI music from text prompts by expanding vocabulary to include MIDI tokens and using two-stage training.

Details

Motivation: To create a system that can generate high-quality multitrack MIDI music from free-form text descriptions, improving upon existing text-to-MIDI models.

Method: Expands text LLM vocabulary with MIDI tokens, uses two-stage training to enable text-to-MIDI generation, and leverages vLLM library for accelerated inference.

Result: Achieves higher quality music generation, better text control, and faster inference compared to the recent Text2midi model.

Conclusion: MIDI-LLM successfully demonstrates effective text-to-MIDI music generation with improved performance over existing approaches.

Abstract: We present MIDI-LLM, an LLM for generating multitrack MIDI music from free-form text prompts. Our approach expands a text LLM’s vocabulary to include MIDI tokens, and uses a two-stage training recipe to endow text-to-MIDI abilities. By preserving the original LLM’s parameter structure, we can directly leverage the vLLM library for accelerated inference. Experiments show that MIDI-LLM achieves higher quality, better text control, and faster inference compared to the recent Text2midi model. Live demo at https://midi-llm-demo.vercel.app.

[240] MusRec: Zero-Shot Text-to-Music Editing via Rectified Flow and Diffusion Transformers

Ali Boudaghi, Hadi Zare

Main category: cs.SD

TL;DR: MusRec is a zero-shot text-to-music editing model that performs diverse editing tasks on real-world music using rectified flow and diffusion transformers, overcoming limitations of existing methods.

Details

Motivation: Existing music editing models are limited to synthesized music, require precise prompts, or need task-specific retraining, lacking true zero-shot capability for real-world music editing applications.

Method: Leverages recent advances in rectified flow and diffusion transformers to create a zero-shot text-to-music editing model capable of handling diverse editing tasks on real-world music.

Result: Outperforms existing methods in preserving musical content, structural consistency, and editing fidelity, demonstrating effective performance across various editing tasks.

Conclusion: Establishes a strong foundation for controllable music editing in real-world scenarios with true zero-shot capability.

Abstract: Music editing has emerged as an important and practical area of artificial intelligence, with applications ranging from video game and film music production to personalizing existing tracks according to user preferences. However, existing models face significant limitations, such as being restricted to editing synthesized music generated by their own models, requiring highly precise prompts, or necessitating task-specific retraining, thus lacking true zero-shot capability. Leveraging recent advances in rectified flow and diffusion transformers, we introduce MusRec, the first zero-shot text-to-music editing model capable of performing diverse editing tasks on real-world music efficiently and effectively. Experimental results demonstrate that our approach outperforms existing methods in preserving musical content, structural consistency, and editing fidelity, establishing a strong foundation for controllable music editing in real-world scenarios.

[241] PromptSep: Generative Audio Separation via Multimodal Prompting

Yutong Wen, Ke Chen, Prem Seetharaman, Oriol Nieto, Jiaqi Su, Rithesh Kumar, Minje Kim, Paris Smaragdis, Zeyu Jin, Justin Salamon

Main category: cs.SD

TL;DR: PromptSep extends language-queried audio source separation into a general-purpose sound separation framework using conditional diffusion models with vocal imitation and data simulation.

Details

Motivation: Current LASS systems have limitations: they only support separation operations, not sound removal, and rely solely on text prompts which can be unintuitive for specifying sound sources.

Method: Uses conditional diffusion model enhanced with data simulation to enable both audio extraction and sound removal. Incorporates vocal imitation as additional conditioning modality using Sketch2Sound data augmentation.

Result: Achieves state-of-the-art performance in sound removal and vocal-imitation-guided source separation, while maintaining competitive results on language-queried source separation.

Conclusion: PromptSep successfully extends LASS capabilities to support both extraction and removal operations with multiple intuitive query modalities including vocal imitation.

Abstract: Recent breakthroughs in language-queried audio source separation (LASS) have shown that generative models can achieve higher separation audio quality than traditional masking-based approaches. However, two key limitations restrict their practical use: (1) users often require operations beyond separation, such as sound removal; and (2) relying solely on text prompts can be unintuitive for specifying sound sources. In this paper, we propose PromptSep to extend LASS into a broader framework for general-purpose sound separation. PromptSep leverages a conditional diffusion model enhanced with elaborated data simulation to enable both audio extraction and sound removal. To move beyond text-only queries, we incorporate vocal imitation as an additional and more intuitive conditioning modality for our model, by incorporating Sketch2Sound as a data augmentation strategy. Both objective and subjective evaluations on multiple benchmarks demonstrate that PromptSep achieves state-of-the-art performance in sound removal and vocal-imitation-guided source separation, while maintaining competitive results on language-queried source separation.

[242] Back to Ear: Perceptually Driven High Fidelity Music Reconstruction

Kangdi Wang, Zhiyue Wu, Dinghao Zhou, Rui Lin, Junyu Dai, Tao Jiang

Main category: cs.SD

TL;DR: εar-VAE is an improved variational autoencoder for audio that enhances perceptual quality through K-weighting filters, novel phase losses for stereo coherence, and a new spectral supervision paradigm, achieving superior reconstruction at 44.1kHz.

Details

Motivation: Existing open-source VAEs for audio tasks often ignore auditory perceptual aspects during training, resulting in poor phase accuracy and weak stereophonic spatial representation.

Method: Proposed three key improvements: (1) K-weighting perceptual filter before loss calculation, (2) novel phase losses including Correlation Loss for stereo coherence and Phase Loss using Instantaneous Frequency and Group Delay derivatives, (3) new spectral supervision paradigm where magnitude is supervised by all Mid/Side/Left/Right components while phase is supervised only by LR components.

Result: εar-VAE at 44.1kHz substantially outperforms leading open-source models across diverse metrics, showing particular strength in reconstructing high-frequency harmonics and spatial characteristics.

Conclusion: The proposed εar-VAE successfully addresses perceptual weaknesses in existing audio VAEs through optimized training paradigm with perceptual filtering, enhanced phase modeling, and improved spectral supervision.

Abstract: Variational Autoencoders (VAEs) are essential for large-scale audio tasks like diffusion-based generation. However, existing open-source models often neglect auditory perceptual aspects during training, leading to weaknesses in phase accuracy and stereophonic spatial representation. To address these challenges, we propose {\epsilon}ar-VAE, an open-source music signal reconstruction model that rethinks and optimizes the VAE training paradigm. Our contributions are threefold: (i) A K-weighting perceptual filter applied prior to loss calculation to align the objective with auditory perception. (ii) Two novel phase losses: a Correlation Loss for stereo coherence, and a Phase Loss using its derivatives–Instantaneous Frequency and Group Delay–for precision. (iii) A new spectral supervision paradigm where magnitude is supervised by all four Mid/Side/Left/Right components, while phase is supervised only by the LR components. Experiments show {\epsilon}ar-VAE at 44.1kHz substantially outperforms leading open-source models across diverse metrics, showing particular strength in reconstructing high-frequency harmonics and the spatial characteristics.

cs.LG

[243] Applying Time Series Deep Learning Models to Forecast the Growth of Perennial Ryegrass in Ireland

Oluwadurotimi Onibonoje, Vuong M. Ngo, Andrew McCarre, Elodie Ruelle, Bernadette O-Briend, Mark Roantree

Main category: cs.LG

TL;DR: Deep learning models, particularly temporal convolutional networks, are proposed for forecasting perennial ryegrass growth in Ireland as a cost-effective alternative to impractical mechanistic models.

Details

Motivation: Grasslands are crucial carbon sinks and biodiversity hotspots, but the Irish dairy sector faces profitability and sustainability challenges with current impractical grass growth forecasting methods.

Method: Developed deep learning models tailored for univariate datasets, specifically temporal convolutional networks using historical grass height data from Cork over 34 years (1,757 weeks).

Result: The temporal convolutional network achieved high performance with RMSE of 2.74 and MAE of 3.46 for forecasting perennial ryegrass growth.

Conclusion: The study enhances understanding of model behavior, improves reliability in grass growth forecasting, and contributes to advancing sustainable dairy farming practices.

Abstract: Grasslands, constituting the world’s second-largest terrestrial carbon sink, play a crucial role in biodiversity and the regulation of the carbon cycle. Currently, the Irish dairy sector, a significant economic contributor, grapples with challenges related to profitability and sustainability. Presently, grass growth forecasting relies on impractical mechanistic models. In response, we propose deep learning models tailored for univariate datasets, presenting cost-effective alternatives. Notably, a temporal convolutional network designed for forecasting Perennial Ryegrass growth in Cork exhibits high performance, leveraging historical grass height data with RMSE of 2.74 and MAE of 3.46. Validation across a comprehensive dataset spanning 1,757 weeks over 34 years provides insights into optimal model configurations. This study enhances our understanding of model behavior, thereby improving reliability in grass growth forecasting and contributing to the advancement of sustainable dairy farming practices.

[244] Federated Learning with Gramian Angular Fields for Privacy-Preserving ECG Classification on Heterogeneous IoT Devices

Youssef Elmir, Yassine Himeur, Abbes Amira

Main category: cs.LG

TL;DR: Federated learning framework for ECG classification using GAF image transformation, achieving 95.18% accuracy while preserving privacy across IoT devices.

Details

Motivation: To enable privacy-preserving ECG classification in IoT healthcare environments where sensitive medical data must remain local to devices.

Method: Transform 1D ECG signals into 2D Gramian Angular Field (GAF) images and use Convolutional Neural Networks (CNNs) in a federated learning framework across heterogeneous IoT devices.

Result: Achieved 95.18% classification accuracy in multi-client setup, outperforming single-client baseline in both accuracy and training time, with efficient resource utilization despite GAF complexity.

Conclusion: The framework demonstrates potential for lightweight, privacy-preserving AI in IoT healthcare monitoring, supporting scalable and secure edge deployments.

Abstract: This study presents a federated learning (FL) framework for privacy-preserving electrocardiogram (ECG) classification in Internet of Things (IoT) healthcare environments. By transforming 1D ECG signals into 2D Gramian Angular Field (GAF) images, the proposed approach enables efficient feature extraction through Convolutional Neural Networks (CNNs) while ensuring that sensitive medical data remain local to each device. This work is among the first to experimentally validate GAF-based federated ECG classification across heterogeneous IoT devices, quantifying both performance and communication efficiency. To evaluate feasibility in realistic IoT settings, we deployed the framework across a server, a laptop, and a resource-constrained Raspberry Pi 4, reflecting edge-cloud integration in IoT ecosystems. Experimental results demonstrate that the FL-GAF model achieves a high classification accuracy of 95.18% in a multi-client setup, significantly outperforming a single-client baseline in both accuracy and training time. Despite the added computational complexity of GAF transformations, the framework maintains efficient resource utilization and communication overhead. These findings highlight the potential of lightweight, privacy-preserving AI for IoT-based healthcare monitoring, supporting scalable and secure edge deployments in smart health systems.

[245] Laugh, Relate, Engage: Stylized Comment Generation for Short Videos

Xuan Ouyang, Senan Wang, Bouzhou Wang, Siyuan Xiahou, Jinrong Zhou, Yuekang Li

Main category: cs.LG

TL;DR: LOLGORITHM is a multi-agent system for generating stylistically diverse short-video comments using multimodal LLMs, achieving over 90% human preference rates on Chinese and English platforms.

Details

Motivation: Short-video platforms need comment generation that is both platform-compliant and stylistically diverse to enhance user engagement and creative interaction.

Method: Modular multi-agent system integrating video segmentation, contextual analysis, and style-aware prompt construction with six comment styles, powered by multimodal LLM processing video inputs directly.

Result: Significantly outperforms baselines with 90%+ preference on Douyin and 87.55% on YouTube, validated through automated metrics and large-scale human study.

Conclusion: Presents a scalable, culturally adaptive framework for stylized comment generation that enhances user engagement on short-video platforms.

Abstract: Short-video platforms have become a central medium in the modern Internet landscape, where efficient information delivery and strong interactivity are reshaping user engagement and cultural dissemination. Among the various forms of user interaction, comments play a vital role in fostering community participation and enabling content re-creation. However, generating comments that are both compliant with platform guidelines and capable of exhibiting stylistic diversity and contextual awareness remains a significant challenge. We introduce LOLGORITHM, a modular multi-agent system (MAS) designed for controllable short-video comment generation. The system integrates video segmentation, contextual and affective analysis, and style-aware prompt construction. It supports six distinct comment styles: puns (homophones), rhyming, meme application, sarcasm (irony), plain humor, and content extraction. Powered by a multimodal large language model (MLLM), LOLGORITHM directly processes video inputs and achieves fine-grained style control through explicit prompt markers and few-shot examples. To support development and evaluation, we construct a bilingual dataset using official APIs from Douyin (Chinese) and YouTube (English), covering five popular video genres: comedy skits, daily life jokes, funny animal clips, humorous commentary, and talk shows. Evaluation combines automated metrics originality, relevance, and style conformity with a large-scale human preference study involving 40 videos and 105 participants. Results show that LOLGORITHM significantly outperforms baseline models, achieving preference rates of over 90% on Douyin and 87.55% on YouTube. This work presents a scalable and culturally adaptive framework for stylized comment generation on short-video platforms, offering a promising path to enhance user engagement and creative interaction.

[246] What’s in Common? Multimodal Models Hallucinate When Reasoning Across Scenes

Candace Ross, Florian Bordes, Adina Williams, Polina Kirichenko, Mark Ibrahim

Main category: cs.LG

TL;DR: A new benchmark called Common-O is introduced to evaluate multimodal language models’ reasoning across real-world scenes, revealing significant gaps between perception performance and actual reasoning capabilities, with models achieving only 35% on standard scenes and 1% on complex scenes.

Details

Motivation: To address the gap between multimodal models' strong performance on perception benchmarks and their poor reasoning in real-world scenarios, particularly their tendency to hallucinate when reasoning across scenes.

Method: Built a benchmark with 10.5k examples using new images not found in web training data, inspired by cognitive tests for humans, to probe reasoning by asking ‘what’s in common?’ across scenes.

Result: Models perform well on single-image object perception but struggle significantly with cross-scene reasoning. Best model achieved only 35% on Common-O and 1% on Common-O Complex. Models hallucinate more when similar objects are present, suggesting reliance on object co-occurrence from training.

Conclusion: Current multimodal models have substantial limitations in reasoning across scenes despite strong perception capabilities. Multi-image training shows promise for improvement, and the benchmark is released to spur research on reducing hallucinations in scene reasoning.

Abstract: Multimodal language models possess a remarkable ability to handle an open-vocabulary’s worth of objects. Yet the best models still suffer from hallucinations when reasoning about scenes in the real world, revealing a gap between their seemingly strong performance on existing perception benchmarks that are saturating and their reasoning in the real world. To address this gap, we build a novel benchmark of in-the-wild scenes that we call Common-O. With more than 10.5k examples using exclusively new images not found in web training data to avoid contamination, Common-O goes beyond just perception, inspired by cognitive tests for humans, to probe reasoning across scenes by asking “what’s in common?”. We evaluate leading multimodal language models, including models specifically trained to perform chain-of-thought reasoning. We find that perceiving objects in single images is tractable for most models, yet reasoning across scenes is very challenging even for the best models, including reasoning models. Despite saturating many leaderboards focusing on perception, the best performing model only achieves 35% on Common-O – and on Common-O Complex, consisting of more complex scenes, the best model achieves only 1%. Curiously, we find models are more prone to hallucinate when similar objects are present in the scene, suggesting models may be relying on object co-occurrence seen during training. Among the models we evaluated, we found scale can provide modest improvements while models explicitly trained with multi-image inputs show bigger improvements, suggesting scaled multi-image training may offer promise. We make our benchmark publicly available to spur research into the challenge of hallucination when reasoning across scenes.

Jaden Park, Mu Cai, Feng Yao, Jingbo Shang, Soochahn Lee, Yong Jae Lee

Main category: cs.LG

TL;DR: Proposes a novel detection method for test-set leakage in Vision-Language Models using multi-modal semantic perturbation, showing contaminated models fail to generalize under controlled perturbations.

Details

Motivation: Address the underexplored problem of detecting test-set contamination in VLMs, as existing approaches fail or show inconsistent behavior when models are trained on leaked benchmark data.

Method: Deliberately contaminate open-source VLMs on popular benchmarks, then develop a detection method based on multi-modal semantic perturbation that reveals when models fail to generalize under controlled perturbations.

Result: The proposed detection method effectively identifies contaminated VLMs across multiple realistic contamination strategies, demonstrating robustness and effectiveness compared to existing approaches.

Conclusion: The multi-modal semantic perturbation approach provides a reliable way to detect test-set leakage in VLMs, addressing a critical concern about inflated benchmark performance due to data contamination.

Abstract: Recent advances in Vision-Language Models (VLMs) have achieved state-of-the-art performance on numerous benchmark tasks. However, the use of internet-scale, often proprietary, pretraining corpora raises a critical concern for both practitioners and users: inflated performance due to test-set leakage. While prior works have proposed mitigation strategies such as decontamination of pretraining data and benchmark redesign for LLMs, the complementary direction of developing detection methods for contaminated VLMs remains underexplored. To address this gap, we deliberately contaminate open-source VLMs on popular benchmarks and show that existing detection approaches either fail outright or exhibit inconsistent behavior. We then propose a novel simple yet effective detection method based on multi-modal semantic perturbation, demonstrating that contaminated models fail to generalize under controlled perturbations. Finally, we validate our approach across multiple realistic contamination strategies, confirming its robustness and effectiveness. The code and perturbed dataset will be released publicly.

[248] FusionDP: Foundation Model-Assisted Differentially Private Learning for Partially Sensitive Features

Linghui Zeng, Ruixuan Liu, Atiquer Rahman Sarkar, Xiaoqian Jiang, Joyce C. Ho, Li Xiong

Main category: cs.LG

TL;DR: FusionDP is a framework that improves model utility under feature-level differential privacy by using foundation models to impute sensitive features and training on both original and imputed features while preserving privacy.

Details

Motivation: Traditional DP-SGD applies privacy protection to all features equally, causing excessive noise and utility degradation, but in practice only some features (like demographic data) need strong privacy protection while others (like lab results) are less sensitive.

Method: Two-step approach: 1) Use foundation models to impute sensitive features from non-sensitive features as external priors, 2) Modified DP-SGD that trains models on both original and imputed features while formally preserving privacy of original sensitive features.

Result: Evaluated on sepsis prediction (PhysioNet) and clinical note classification (MIMIC-III), FusionDP significantly improves model performance compared to privacy-preserving baselines while maintaining rigorous feature-level privacy.

Conclusion: Foundation model-driven imputation can enhance the privacy-utility trade-off for various modalities by providing high-quality estimates of sensitive attributes without accessing true values during training.

Abstract: Ensuring the privacy of sensitive training data is crucial in privacy-preserving machine learning. However, in practical scenarios, privacy protection may be required for only a subset of features. For instance, in ICU data, demographic attributes like age and gender pose higher privacy risks due to their re-identification potential, whereas raw lab results are generally less sensitive. Traditional DP-SGD enforces privacy protection on all features in one sample, leading to excessive noise injection and significant utility degradation. We propose FusionDP, a two-step framework that enhances model utility under feature-level differential privacy. First, FusionDP leverages large foundation models to impute sensitive features given non-sensitive features, treating them as external priors that provide high-quality estimates of sensitive attributes without accessing the true values during model training. Second, we introduce a modified DP-SGD algorithm that trains models on both original and imputed features while formally preserving the privacy of the original sensitive features. We evaluate FusionDP on two modalities: a sepsis prediction task on tabular data from PhysioNet and a clinical note classification task from MIMIC-III. By comparing against privacy-preserving baselines, our results show that FusionDP significantly improves model performance while maintaining rigorous feature-level privacy, demonstrating the potential of foundation model-driven imputation to enhance the privacy-utility trade-off for various modalities.

[249] Fair and Explainable Credit-Scoring under Concept Drift: Adaptive Explanation Frameworks for Evolving Populations

Shivogo John

Main category: cs.LG

TL;DR: This paper develops adaptive explanation frameworks to maintain interpretability and fairness in credit-scoring systems when concept drift occurs, showing that adaptive SHAP variants significantly improve temporal stability and reduce disparate impact without sacrificing predictive accuracy.

Details

Motivation: Conventional explainability techniques like SHAP assume static data distributions, making their explanations unstable and potentially unfair when concept drift occurs in credit-scoring systems due to evolving borrower behaviors, economic conditions, and regulatory changes.

Method: Integrates XGBoost with three adaptive SHAP variants: (A) per-slice explanation reweighting for feature distribution shifts, (B) drift-aware SHAP rebaselining with sliding-window background samples, and (C) online surrogate calibration using incremental Ridge regression, benchmarked against static SHAP using multiple metrics.

Result: Adaptive methods, particularly rebaselined and surrogate-based explanations, substantially improve temporal stability and reduce disparate impact across demographic groups without degrading predictive accuracy, as confirmed by robustness tests including counterfactual perturbations and background sensitivity analysis.

Conclusion: Adaptive explainability provides a practical mechanism for sustaining transparency, accountability, and ethical reliability in data-driven credit systems and other domains where decision models evolve with population change.

Abstract: Evolving borrower behaviors, shifting economic conditions, and changing regulatory landscapes continuously reshape the data distributions underlying modern credit-scoring systems. Conventional explainability techniques, such as SHAP, assume static data and fixed background distributions, making their explanations unstable and potentially unfair when concept drift occurs. This study addresses that challenge by developing adaptive explanation frameworks that recalibrate interpretability and fairness in dynamically evolving credit models. Using a multi-year credit dataset, we integrate predictive modeling via XGBoost with three adaptive SHAP variants: (A) per-slice explanation reweighting that adjusts for feature distribution shifts, (B) drift-aware SHAP rebaselining with sliding-window background samples, and (C) online surrogate calibration using incremental Ridge regression. Each method is benchmarked against static SHAP explanations using metrics of predictive performance (AUC, F1), directional and rank stability (cosine, Kendall tau), and fairness (demographic parity and recalibration). Results show that adaptive methods, particularly rebaselined and surrogate-based explanations, substantially improve temporal stability and reduce disparate impact across demographic groups without degrading predictive accuracy. Robustness tests, including counterfactual perturbations, background sensitivity analysis, and proxy-variable detection, confirm the resilience of adaptive explanations under real-world drift conditions. These findings establish adaptive explainability as a practical mechanism for sustaining transparency, accountability, and ethical reliability in data-driven credit systems, and more broadly, in any domain where decision models evolve with population change.

[250] Optimizing Reasoning Efficiency through Prompt Difficulty Prediction

Bo Zhao, Berkcan Kapusuzoglu, Kartik Balasubramaniam, Sambit Sahu, Supriyo Chakraborty, Genta Indra Winata

Main category: cs.LG

TL;DR: Proposes a routing system that assigns problems to the smallest capable reasoning model, reducing compute costs while maintaining accuracy.

Details

Motivation: Reasoning language models are effective but expensive to deploy due to their large size and long reasoning processes, creating a need for cost-efficient deployment methods.

Method: Uses intermediate representations from s1.1-32B model to train lightweight predictors that estimate problem difficulty or model correctness, then routes problems to appropriate smaller models in a pool.

Result: On diverse math benchmarks, routing improves efficiency over random assignment and matches s1.1-32B’s performance while using significantly less compute.

Conclusion: Difficulty-aware routing is an effective approach for cost-efficient deployment of reasoning models without sacrificing accuracy.

Abstract: Reasoning language models perform well on complex tasks but are costly to deploy due to their size and long reasoning traces. We propose a routing approach that assigns each problem to the smallest model likely to solve it, reducing compute without sacrificing accuracy. Using intermediate representations from s1.1-32B, we train lightweight predictors of problem difficulty or model correctness to guide routing across a pool of reasoning models. On diverse math benchmarks, routing improves efficiency over random assignment and matches s1.1-32B’s performance while using significantly less compute. Our results demonstrate that difficulty-aware routing is effective for cost-efficient deployment of reasoning models.

[251] One Size Does Not Fit All: Architecture-Aware Adaptive Batch Scheduling with DEBA

François Belias, Naser Ezzati-Jivan, Foutse Khomh

Main category: cs.LG

TL;DR: DEBA is an adaptive batch scheduler that monitors gradient variance, gradient norm variation, and loss variation to dynamically adjust batch sizes. The study shows that architecture fundamentally determines adaptation efficacy, with lightweight models achieving 45-62% speedup and accuracy improvements, while deeper networks show high variance.

Details

Motivation: Existing adaptive batch size methods apply identical strategies across all architectures, assuming a one-size-fits-all solution, but this work challenges that assumption by showing architecture-specific adaptation benefits.

Method: DEBA monitors gradient variance, gradient norm variation, and loss variation to guide batch size adaptations. Uses sliding window statistics and sufficient cooldown periods between adaptations. Evaluated across six architectures on CIFAR-10/100 with five random seeds per configuration.

Result: Lightweight architectures achieved 45-62% training speedup with 1-7% accuracy improvements. ResNet-18 showed consistent gains (+2.4-4.0% accuracy, 36-43% speedup), while ResNet-50 exhibited high variance. ViT-B16 showed minimal speedup (6%). Introduced baseline characterization framework using gradient stability metrics.

Conclusion: Batch size adaptation requires architecture-aware design, challenging the prevailing assumption that adaptive methods generalize across architectures. Architecture fundamentally determines adaptation efficacy.

Abstract: Adaptive batch size methods aim to accelerate neural network training, but existing approaches apply identical adaptation strategies across all architectures, assuming a one-size-fits-all solution. We introduce DEBA (Dynamic Efficient Batch Adaptation), an adaptive batch scheduler that monitors gradient variance, gradient norm variation and loss variation to guide batch size adaptations. Through systematic evaluation across six architectures (ResNet-18/50, DenseNet-121, EfficientNet-B0, MobileNet-V3, ViT-B16) on CIFAR-10 and CIFAR-100, with five random seeds per configuration, we demonstrate that the architecture fundamentally determines adaptation efficacy. Our findings reveal that: (1) lightweight and medium-depth architectures (MobileNet-V3, DenseNet-121, EfficientNet-B0) achieve a 45-62% training speedup with simultaneous accuracy improvements of 1-7%; (2) shallow residual networks (ResNet-18) show consistent gains of +2.4 - 4.0% in accuracy, 36 - 43% in speedup, while deep residual networks (ResNet-50) exhibit high variance and occasional degradation; (3) already-stable architectures (ViT-B16) show minimal speedup (6%) despite maintaining accuracy, indicating that adaptation benefits vary with baseline optimization characteristics. We introduce a baseline characterization framework using gradient stability metrics (stability score, gradient norm variation) that predicts which architectures will benefit from adaptive scheduling. Our ablation studies reveal critical design choices often overlooked in prior work: sliding window statistics (vs. full history) and sufficient cooldown periods (5+ epochs) between adaptations are essential for success. This work challenges the prevailing assumption that adaptive methods generalize across architectures and provides the first systematic evidence that batch size adaptation requires an architecture-aware design.

[252] Regret Lower Bounds for Decentralized Multi-Agent Stochastic Shortest Path Problems

Utkarsh U. Chavan, Prashant Trivedi, Nandyala Hemachandra

Main category: cs.LG

TL;DR: This paper studies decentralized multi-agent stochastic shortest path problems under linear function approximation, establishing the first regret lower bound of Ω(√K) for this setting.

Details

Motivation: Multi-agent systems are crucial for applications like swarm robotics and traffic routing, but the decentralized multi-agent variant of stochastic shortest path problems remains largely unexplored despite extensive single-agent research.

Method: The authors study decentralized multi-agent SSPs under linear function approximation, using novel symmetry-based arguments to identify optimal policy structures and constructing hard-to-learn instances.

Result: The main contribution is the first regret lower bound of Ω(√K) for decentralized multi-agent SSPs, which holds for any number of agents and highlights the inherent learning difficulty in this setting.

Conclusion: The established regret lower bound clarifies the learning complexity of decentralized control and provides guidance for designing efficient learning algorithms in multi-agent systems.

Abstract: Multi-agent systems (MAS) are central to applications such as swarm robotics and traffic routing, where agents must coordinate in a decentralized manner to achieve a common objective. Stochastic Shortest Path (SSP) problems provide a natural framework for modeling decentralized control in such settings. While the problem of learning in SSP has been extensively studied in single-agent settings, the decentralized multi-agent variant remains largely unexplored. In this work, we take a step towards addressing that gap. We study decentralized multi-agent SSPs (Dec-MASSPs) under linear function approximation, where the transition dynamics and costs are represented using linear models. Applying novel symmetry-based arguments, we identify the structure of optimal policies. Our main contribution is the first regret lower bound for this setting based on the construction of hard-to-learn instances for any number of agents, $n$. Our regret lower bound of $\Omega(\sqrt{K})$, over $K$ episodes, highlights the inherent learning difficulty in Dec-MASSPs. These insights clarify the learning complexity of decentralized control and can further guide the design of efficient learning algorithms in multi-agent systems.

[253] Sketch-Augmented Features Improve Learning Long-Range Dependencies in Graph Neural Networks

Ryien Hosseini, Filippo Simini, Venkatram Vishwanath, Rebecca Willett, Henry Hoffmann

Main category: cs.LG

TL;DR: Injecting randomized global embeddings (Sketched Random Features) into GNNs to address oversquashing, oversmoothing, and limited expressive power issues.

Details

Motivation: Standard GNNs face three key challenges: oversquashing of long-range information, oversmoothing of node representations, and limited expressive power due to their local message passing paradigm.

Method: Injecting sketched random features - randomized global embeddings of node features that are unique, distance-sensitive, and topology-agnostic - into standard GNN architectures.

Result: Experimental results show consistent performance improvements over baseline GNNs on real-world graph learning tasks, effectively capturing long-range dependencies.

Conclusion: Sketched Random Features offer both a standalone solution and complementary enhancement to existing techniques like graph positional encodings for improving GNN performance.

Abstract: Graph Neural Networks learn on graph-structured data by iteratively aggregating local neighborhood information. While this local message passing paradigm imparts a powerful inductive bias and exploits graph sparsity, it also yields three key challenges: (i) oversquashing of long-range information, (ii) oversmoothing of node representations, and (iii) limited expressive power. In this work we inject randomized global embeddings of node features, which we term \textit{Sketched Random Features}, into standard GNNs, enabling them to efficiently capture long-range dependencies. The embeddings are unique, distance-sensitive, and topology-agnostic – properties which we analytically and empirically show alleviate the aforementioned limitations when injected into GNNs. Experimental results on real-world graph learning tasks confirm that this strategy consistently improves performance over baseline GNNs, offering both a standalone solution and a complementary enhancement to existing techniques such as graph positional encodings. Our source code is available at \href{https://github.com/ryienh/sketched-random-features}{https://github.com/ryienh/sketched-random-features}.

[254] From Static to Dynamic: Enhancing Offline-to-Online Reinforcement Learning via Energy-Guided Diffusion Stratification

Lipeng Zu, Hansong Zhou, Xiaonan Zhang

Main category: cs.LG

TL;DR: StratDiff uses diffusion models and energy-based functions to stratify samples into offline-like and online-like categories during offline-to-online RL transitions, improving performance and stability.

Details

Motivation: Address distributional shifts between offline datasets and evolving online policies in RL, and leverage the distributional structure of offline data for better adaptation.

Method: Energy-Guided Diffusion Stratification (StratDiff) uses diffusion models to learn from offline data, refines with energy functions, computes KL divergence to stratify samples, and applies different learning strategies to each subset.

Result: Significantly outperforms existing methods on D4RL benchmarks, achieving enhanced adaptability and more stable performance across diverse RL settings.

Conclusion: StratDiff effectively bridges offline-to-online RL transitions by leveraging data distribution structure and stratified learning strategies.

Abstract: Transitioning from offline to online reinforcement learning (RL) poses critical challenges due to distributional shifts between the fixed behavior policy in the offline dataset and the evolving policy during online learning. Although this issue is widely recognized, few methods attempt to explicitly assess or utilize the distributional structure of the offline data itself, leaving a research gap in adapting learning strategies to different types of samples. To address this challenge, we propose an innovative method, Energy-Guided Diffusion Stratification (StratDiff), which facilitates smoother transitions in offline-to-online RL. StratDiff deploys a diffusion model to learn prior knowledge from the offline dataset. It then refines this knowledge through energy-based functions to improve policy imitation and generate offline-like actions during online fine-tuning. The KL divergence between the generated action and the corresponding sampled action is computed for each sample and used to stratify the training batch into offline-like and online-like subsets. Offline-like samples are updated using offline objectives, while online-like samples follow online learning strategies. We demonstrate the effectiveness of StratDiff by integrating it with off-the-shelf methods Cal-QL and IQL. Extensive empirical evaluations on D4RL benchmarks show that StratDiff significantly outperforms existing methods, achieving enhanced adaptability and more stable performance across diverse RL settings.

[255] Higher-Order Causal Structure Learning with Additive Models

James Enouen, Yujia Zheng, Ignavier Ng, Yan Liu, Kun Zhang

Main category: cs.LG

TL;DR: Extends causal additive models to include higher-order interactions using directed acyclic hypergraphs, provides identifiability results, and develops a greedy algorithm for learning hyper DAGs.

Details

Motivation: Real-world processes often exhibit higher-order mechanisms, but causal discovery has largely ignored explicit treatment of interactions beyond pairwise relationships.

Method: Extends CAM to additive models with higher-order interactions, introduces directed acyclic hypergraphs, provides theoretical framework and identifiability results, and develops a greedy algorithm for hyper DAG learning.

Result: Shows that learning hypergraph structure can lead to better empirical results, with more restrictive assumptions (like CAM) corresponding to easier-to-learn hyper DAGs and better finite sample complexity.

Conclusion: The proposed extension of CAM to handle higher-order interactions via hyper DAGs is theoretically sound and empirically useful, with the greedy algorithm demonstrating effectiveness in synthetic experiments.

Abstract: Causal structure learning has long been the central task of inferring causal insights from data. Despite the abundance of real-world processes exhibiting higher-order mechanisms, however, an explicit treatment of interactions in causal discovery has received little attention. In this work, we focus on extending the causal additive model (CAM) to additive models with higher-order interactions. This second level of modularity we introduce to the structure learning problem is most easily represented by a directed acyclic hypergraph which extends the DAG. We introduce the necessary definitions and theoretical tools to handle the novel structure we introduce and then provide identifiability results for the hyper DAG, extending the typical Markov equivalence classes. We next provide insights into why learning the more complex hypergraph structure may actually lead to better empirical results. In particular, more restrictive assumptions like CAM correspond to easier-to-learn hyper DAGs and better finite sample complexity. We finally develop an extension of the greedy CAM algorithm which can handle the more complex hyper DAG search space and demonstrate its empirical usefulness in synthetic experiments.

[256] Enhancing Q-Value Updates in Deep Q-Learning via Successor-State Prediction

Lipeng Zu, Hansong Zhou, Xiaonan Zhang

Main category: cs.LG

TL;DR: SADQ improves DQN by modeling environment dynamics with successor-state distributions, reducing variance and improving policy alignment in Q-value updates.

Details

Motivation: DQN's target updates use next states from potentially suboptimal past policies, causing high variance and poor learning signals when transitions don't align with current policy.

Method: Proposes Successor-state Aggregation Deep Q-Network (SADQ) that explicitly models environment dynamics using stochastic transition model and integrates successor-state distributions into Q-value estimation.

Result: Theoretical guarantees show SADQ maintains unbiased value estimates while reducing training variance. Empirical results demonstrate consistent outperformance over DQN variants in stability and learning efficiency across RL benchmarks and real-world control tasks.

Conclusion: SADQ provides more stable and policy-aligned value updates through explicit environment dynamics modeling, addressing limitations of standard DQN approaches.

Abstract: Deep Q-Networks (DQNs) estimate future returns by learning from transitions sampled from a replay buffer. However, the target updates in DQN often rely on next states generated by actions from past, potentially suboptimal, policy. As a result, these states may not provide informative learning signals, causing high variance into the update process. This issue is exacerbated when the sampled transitions are poorly aligned with the agent’s current policy. To address this limitation, we propose the Successor-state Aggregation Deep Q-Network (SADQ), which explicitly models environment dynamics using a stochastic transition model. SADQ integrates successor-state distributions into the Q-value estimation process, enabling more stable and policy-aligned value updates. Additionally, it explores a more efficient action selection strategy with the modeled transition structure. We provide theoretical guarantees that SADQ maintains unbiased value estimates while reducing training variance. Our extensive empirical results across standard RL benchmarks and real-world vector-based control tasks demonstrate that SADQ consistently outperforms DQN variants in both stability and learning efficiency.

[257] Rater Equivalence: Evaluating Classifiers in Human Judgment Settings

Paul Resnick, Yuqing Kong, Grant Schoenebeck, Tim Weninger

Main category: cs.LG

TL;DR: A framework for evaluating classifiers using human judgments instead of ground truth, measuring performance through rater equivalence - the minimum number of human raters needed to match classifier performance.

Details

Motivation: In many decision settings, ground truth is unavailable or non-existent, making traditional evaluation methods impossible. There's a need to compare automated classifiers to human judgment when definitive truth is inaccessible.

Method: The framework uses human-generated labels to construct benchmark panels and evaluate performance. It quantifies classifier performance through rater equivalence and distinguishes between two utility models: agreement with inaccessible ground truth and matching individual human judgments.

Result: Through case studies and formal analysis, the framework demonstrates practical utility for evaluating and deploying AI systems in real-world scenarios where ground truth is unavailable.

Conclusion: The proposed framework provides a practical approach for classifier evaluation in settings without definitive ground truth, enabling meaningful comparison between automated systems and human judgment through the rater equivalence metric.

Abstract: In many decision settings, the definitive ground truth is either non-existent or inaccessible. We introduce a framework for evaluating classifiers based solely on human judgments. In such cases, it is helpful to compare automated classifiers to human judgment. We quantify a classifier’s performance by its rater equivalence: the smallest number of human raters whose combined judgment matches the classifier’s performance. Our framework uses human-generated labels both to construct benchmark panels and to evaluate performance. We distinguish between two models of utility: one based on agreement with the assumed but inaccessible ground truth, and one based on matching individual human judgments. Using case studies and formal analysis, we demonstrate how this framework can inform the evaluation and deployment of AI systems in practice.

Kimia Kazemian, Zhenzhen Liu, Yangfanyu Yang, Katie Z Luo, Shuhan Gu, Audrey Du, Xinyu Yang, Jack Jansons, Kilian Q Weinberger, John Thickstun, Yian Yin, Sarah Dean

Main category: cs.LG

TL;DR: The paper introduces Lead-Lag Forecasting (LLF) as a new time-series forecasting paradigm for predicting delayed outcomes from early interactions, presents two benchmark datasets (arXiv and GitHub), and establishes foundations for systematic research in this area.

Details

Motivation: To address the lack of standardized datasets and unified treatment for forecasting problems where early interactions (like views, likes) predict temporally shifted outcomes (like citations, sales) in social and collaborative platforms.

Method: Created two high-volume benchmark datasets (arXiv with 2.3M papers and GitHub with 3M repositories), documented data curation and cleaning processes, verified lead-lag dynamics through statistical tests, and benchmarked parametric and non-parametric baselines for regression.

Result: Established LLF as a novel forecasting paradigm, provided standardized datasets capturing long-horizon dynamics across years, and demonstrated the presence of lead-lag relationships through empirical verification.

Conclusion: The study successfully establishes LLF as a new forecasting problem domain and provides an empirical foundation with benchmark datasets for systematic exploration of lead-lag forecasting in social and usage data.

Abstract: Social and collaborative platforms emit multivariate time-series traces in which early interactions-such as views, likes, or downloads-are followed, sometimes months or years later, by higher impact like citations, sales, or reviews. We formalize this setting as Lead-Lag Forecasting (LLF): given an early usage channel (the lead), predict a correlated but temporally shifted outcome channel (the lag). Despite the ubiquity of such patterns, LLF has not been treated as a unified forecasting problem within the time-series community, largely due to the absence of standardized datasets. To anchor research in LLF, here we present two high-volume benchmark datasets-arXiv (accesses -> citations of 2.3M papers) and GitHub (pushes/stars -> forks of 3M repositories)-and outline additional domains with analogous lead-lag dynamics, including Wikipedia (page views -> edits), Spotify (streams -> concert attendance), e-commerce (click-throughs -> purchases), and LinkedIn profile (views -> messages). Our datasets provide ideal testbeds for lead-lag forecasting, by capturing long-horizon dynamics across years, spanning the full spectrum of outcomes, and avoiding survivorship bias in sampling. We documented all technical details of data curation and cleaning, verified the presence of lead-lag dynamics through statistical and classification tests, and benchmarked parametric and non-parametric baselines for regression. Our study establishes LLF as a novel forecasting paradigm and lays an empirical foundation for its systematic exploration in social and usage data. Our data portal with downloads and documentation is available at https://lead-lag-forecasting.github.io/.

[259] DecoHD: Decomposed Hyperdimensional Classification under Extreme Memory Budgets

Sanggeon Yun, Hyunwoo Oh, Ryozo Masukawa, Mohsen Imani

Main category: cs.LG

TL;DR: Decomposition-based compression method for hyperdimensional computing that achieves extreme memory savings while maintaining accuracy and improving robustness.

Details

Motivation: Traditional HDC compression methods shrink the feature axis, which erodes concentration and robustness. Prior decompositions use fixed atomic hypervectors that are ill-suited for compressing learned class prototypes.

Method: DecoHD learns directly in a decomposed HDC parameterization using a small shared set of per-layer channels with multiplicative binding across layers and bundling at the end. It compresses along the class axis via a lightweight bundling head while preserving native bind-bundle-score operations.

Result: Achieves extreme memory savings with only minor accuracy degradation (within 0.1-0.15% of baseline, worst case 5.7%), more robust to random bit-flip noise, reaches accuracy plateau with ~97% fewer trainable parameters, and delivers significant energy/speed gains over CPU, GPU, and baseline HDC ASIC.

Conclusion: DecoHD enables efficient HDC compression that maintains accuracy and robustness while achieving substantial hardware benefits, making it suitable for tight deployment budgets and in/near-memory accelerators.

Abstract: Decomposition is a proven way to shrink deep networks without changing I/O. We bring this idea to hyperdimensional computing (HDC), where footprint cuts usually shrink the feature axis and erode concentration and robustness. Prior HDC decompositions decode via fixed atomic hypervectors, which are ill-suited for compressing learned class prototypes. We introduce DecoHD, which learns directly in a decomposed HDC parameterization: a small, shared set of per-layer channels with multiplicative binding across layers and bundling at the end, yielding a large representational space from compact factors. DecoHD compresses along the class axis via a lightweight bundling head while preserving native bind-bundle-score; training is end-to-end, and inference remains pure HDC, aligning with in/near-memory accelerators. In evaluation, DecoHD attains extreme memory savings with only minor accuracy degradation under tight deployment budgets. On average it stays within about 0.1-0.15% of a strong non-reduced HDC baseline (worst case 5.7%), is more robust to random bit-flip noise, reaches its accuracy plateau with up to ~97% fewer trainable parameters, and – in hardware – delivers roughly 277x/35x energy/speed gains over a CPU (AMD Ryzen 9 9950X), 13.5x/3.7x over a GPU (NVIDIA RTX 4090), and 2.0x/2.4x over a baseline HDC ASIC.

[260] On Predicting Sociodemographics from Mobility Signals

Ekin Uğurel, Cynthia Chen, Brian H. Y. Lee, Filipe Rodrigues

Main category: cs.LG

TL;DR: The paper develops a framework for predicting sociodemographic attributes from mobility data using interpretable higher-order features, uncertainty quantification tools, and multitask learning to improve accuracy and generalization.

Details

Motivation: Inferring sociodemographic attributes from mobility data is challenging due to weak relationships between mobility patterns and sociodemographic traits, and limited generalization across contexts.

Method: Three approaches: 1) Behaviorally grounded higher-order mobility descriptors based on directed mobility graphs capturing trip sequences, travel modes, and social co-travel; 2) Metrics and visual diagnostic tools for uncertainty quantification; 3) Multitask learning framework that jointly predicts multiple sociodemographic attributes from a shared representation.

Result: The higher-order features significantly improve prediction of age, gender, income, and household structure over baseline features. The multitask learning approach outperforms single-task models, especially with limited training data or when test set distribution differs from training set.

Conclusion: The proposed framework addresses key challenges in sociodemographic inference from mobility data by improving predictive accuracy, interpretability, uncertainty quantification, and generalization across contexts.

Abstract: Inferring sociodemographic attributes from mobility data could help transportation planners better leverage passively collected datasets, but this task remains difficult due to weak and inconsistent relationships between mobility patterns and sociodemographic traits, as well as limited generalization across contexts. We address these challenges from three angles. First, to improve predictive accuracy while retaining interpretability, we introduce a behaviorally grounded set of higher-order mobility descriptors based on directed mobility graphs. These features capture structured patterns in trip sequences, travel modes, and social co-travel, and significantly improve prediction of age, gender, income, and household structure over baselines features. Second, we introduce metrics and visual diagnostic tools that encourage evenness between model confidence and accuracy, enabling planners to quantify uncertainty. Third, to improve generalization and sample efficiency, we develop a multitask learning framework that jointly predicts multiple sociodemographic attributes from a shared representation. This approach outperforms single-task models, particularly when training data are limited or when applying models across different time periods (i.e., when the test set distribution differs from the training set).

[261] SynQuE: Estimating Synthetic Dataset Quality Without Annotations

Arthur Chen, Victor Zhong

Main category: cs.LG

TL;DR: SynQuE framework ranks synthetic datasets by expected real-world task performance using limited unannotated real data, addressing data scarcity challenges. LENS proxy metric outperforms others on complex tasks by leveraging LLM reasoning.

Details

Motivation: Address critical challenge of data scarcity due to collection costs or privacy constraints by enabling effective selection of synthetic datasets for training when real data is limited.

Method: Introduce proxy metrics including distribution/diversity-based distance measures via embeddings, and propose LENS - a novel proxy that leverages large language model reasoning for complex planning tasks.

Result: SynQuE proxies correlate with real task performance across diverse tasks. On text-to-SQL parsing, training on top-3 synthetic datasets selected via SynQuE raises accuracy from 30.4% to 38.4% (+8.1%) compared to indiscriminate selection.

Conclusion: Establishes SynQuE as practical framework for synthetic data selection under real-data scarcity and motivates future research on foundation model-based data characterization and fine-grained data selection.

Abstract: We introduce and formalize the Synthetic Dataset Quality Estimation (SynQuE) problem: ranking synthetic datasets by their expected real-world task performance using only limited unannotated real data. This addresses a critical and open challenge where data is scarce due to collection costs or privacy constraints. We establish the first comprehensive benchmarks for this problem by introducing and evaluating proxy metrics that choose synthetic data for training to maximize task performance on real data. We introduce the first proxy metrics for SynQuE by adapting distribution and diversity-based distance measures to our context via embedding models. To address the shortcomings of these metrics on complex planning tasks, we propose LENS, a novel proxy that leverages large language model reasoning. Our results show that SynQuE proxies correlate with real task performance across diverse tasks, including sentiment analysis, Text2SQL, web navigation, and image classification, with LENS consistently outperforming others on complex tasks by capturing nuanced characteristics. For instance, on text-to-SQL parsing, training on the top-3 synthetic datasets selected via SynQuE proxies can raise accuracy from 30.4% to 38.4 (+8.1)% on average compared to selecting data indiscriminately. This work establishes SynQuE as a practical framework for synthetic data selection under real-data scarcity and motivates future research on foundation model-based data characterization and fine-grained data selection.

[262] NVIDIA Nemotron Nano V2 VL

NVIDIA, :, Amala Sanjay Deshmukh, Kateryna Chumachenko, Tuomas Rintamaki, Matthieu Le, Tyler Poon, Danial Mohseni Taheri, Ilia Karmanov, Guilin Liu, Jarno Seppanen, Guo Chen, Karan Sapra, Zhiding Yu, Adi Renduchintala, Charles Wang, Peter Jin, Arushi Goel, Mike Ranzinger, Lukas Voegtle, Philipp Fischer, Timo Roman, Wei Ping, Boxin Wang, Zhuolin Yang, Nayeon Lee, Shaokun Zhang, Fuxiao Liu, Zhiqi Li, Di Zhang, Greg Heinrich, Hongxu, Yin, Song Han, Pavlo Molchanov, Parth Mannan, Yao Xu, Jane Polak Scowcroft, Tom Balough, Subhashree Radhakrishnan, Paris Zhang, Sean Cha, Ratnesh Kumar, Zaid Pervaiz Bhat, Jian Zhang, Darragh Hanley, Pritam Biswas, Jesse Oliver, Kevin Vasques, Roger Waleffe, Duncan Riach, Oluwatobi Olabiyi, Ameya Sunil Mahabaleshwarkar, Bilal Kartal, Pritam Gundecha, Khanh Nguyen, Alexandre Milesi, Eugene Khvedchenia, Ran Zilberstein, Ofri Masad, Natan Bagrov, Nave Assaf, Tomer Asida, Daniel Afrimi, Amit Zuker, Netanel Haber, Zhiyu Cheng, Jingyu, Xin, Di, Wu, Nik Spirin, Maryam Moosaei, Roman Ageev, Vanshil Atul Shah, Yuting Wu, Daniel Korzekwa, Unnikrishnan Kizhakkemadam Sreekumar, Wanli Jiang, Padmavathy Subramanian, Alejandra Rico, Sandip Bhaskar, Saeid Motiian, Kedi Wu, Annie Surla, Chia-Chih Chen, Hayden Wolff, Matthew Feinberg, Melissa Corpuz, Marek Wawrzos, Eileen Long, Aastha Jhunjhunwala, Paul Hendricks, Farzan Memarian, Benika Hall, Xin-Yu Wang, David Mosallanezhad, Soumye Singhal, Luis Vega, Katherine Cheung, Krzysztof Pawelec, Michael Evans, Katherine Luna, Jie Lou, Erick Galinkin, Akshay Hazare, Kaustubh Purandare, Ann Guan, Anna Warno, Chen Cui, Yoshi Suhara, Shibani Likhite, Seph Mard, Meredith Price, Laya Sleiman, Saori Kaji, Udi Karpas, Kari Briski, Joey Conway, Michael Lightstone, Jan Kautz, Mohammad Shoeybi, Mostofa Patwary, Jonathen Cohen, Oleksii Kuchaiev, Andrew Tao, Bryan Catanzaro

Main category: cs.LG

TL;DR: Nemotron Nano V2 VL is an advanced vision-language model that improves document understanding, video comprehension, and reasoning through enhanced architecture, datasets, and training methods.

Details

Motivation: To create a more efficient model for real-world document understanding and long video comprehension with better performance across vision and text domains.

Method: Built on Nemotron Nano V2 (hybrid Mamba-Transformer LLM) with innovative token reduction techniques for higher inference throughput in long documents and videos.

Result: Significant improvements over previous Llama-3.1-Nemotron-Nano-VL-8B model across all vision and text domains.

Conclusion: The model achieves enhanced performance and efficiency, with checkpoints released in multiple formats and open sharing of datasets, recipes, and training code.

Abstract: We introduce Nemotron Nano V2 VL, the latest model of the Nemotron vision-language series designed for strong real-world document understanding, long video comprehension, and reasoning tasks. Nemotron Nano V2 VL delivers significant improvements over our previous model, Llama-3.1-Nemotron-Nano-VL-8B, across all vision and text domains through major enhancements in model architecture, datasets, and training recipes. Nemotron Nano V2 VL builds on Nemotron Nano V2, a hybrid Mamba-Transformer LLM, and innovative token reduction techniques to achieve higher inference throughput in long document and video scenarios. We are releasing model checkpoints in BF16, FP8, and FP4 formats and sharing large parts of our datasets, recipes and training code.

[263] LogHD: Robust Compression of Hyperdimensional Classifiers via Logarithmic Class-Axis Reduction

Sanggeon Yun, Hyunwoo Oh, Ryozo Masukawa, Pietro Mercati, Nathaniel D. Bastian, Mohsen Imani

Main category: cs.LG

TL;DR: LogHD introduces logarithmic class-axis compression for hyperdimensional computing, reducing memory from O(CD) to O(D log C) while maintaining robustness and outperforming feature-axis compression methods.

Details

Motivation: Standard hyperdimensional computing requires O(CD) memory which is inefficient for memory-constrained systems. Prior compaction methods reduce dimensionality but weaken robustness.

Method: LogHD replaces C per-class prototypes with n≈⌈log_k C⌉ bundle hypervectors using a capacity-aware codebook and profile-based decoding, while preserving dimensionality D.

Result: LogHD achieves competitive accuracy with smaller models, higher resilience at matched memory, and sustains target accuracy at 2.5-3.0× higher bit-flip rates than feature-axis compression. ASIC implementation shows 498× energy efficiency and 62.6× speedup over AMD Ryzen 9.

Conclusion: LogHD provides an effective logarithmic compression approach for hyperdimensional computing that maintains robustness while significantly reducing memory requirements and improving efficiency.

Abstract: Hyperdimensional computing (HDC) suits memory, energy, and reliability-constrained systems, yet the standard “one prototype per class” design requires $O(CD)$ memory (with $C$ classes and dimensionality $D$). Prior compaction reduces $D$ (feature axis), improving storage/compute but weakening robustness. We introduce LogHD, a logarithmic class-axis reduction that replaces the $C$ per-class prototypes with $n!\approx!\lceil\log_k C\rceil$ bundle hypervectors (alphabet size $k$) and decodes in an $n$-dimensional activation space, cutting memory to $O(D\log_k C)$ while preserving $D$. LogHD uses a capacity-aware codebook and profile-based decoding, and composes with feature-axis sparsification. Across datasets and injected bit flips, LogHD attains competitive accuracy with smaller models and higher resilience at matched memory. Under equal memory, it sustains target accuracy at roughly $2.5$-$3.0\times$ higher bit-flip rates than feature-axis compression; an ASIC instantiation delivers $498\times$ energy efficiency and $62.6\times$ speedup over an AMD Ryzen 9 9950X and $24.3\times$/$6.58\times$ over an NVIDIA RTX 4090, and is $4.06\times$ more energy-efficient and $2.19\times$ faster than a feature-axis HDC ASIC baseline.

[264] RLHF: A comprehensive Survey for Cultural, Multimodal and Low Latency Alignment Methods

Raghav Sharma, Manan Mehta, Sai Tiger Raina

Main category: cs.LG

TL;DR: This survey paper synthesizes the evolving landscape of RLHF alignment research, moving beyond text-based methods to address multi-modal alignment, cultural fairness, and low-latency optimization.

Details

Motivation: To address critical gaps in current alignment research by exploring multi-modal alignment, cultural fairness, and low-latency optimization beyond traditional text-based RLHF methods.

Method: Systematic review of foundational algorithms (PPO, DPO, GRPO) followed by detailed analysis of latest innovations, with comparative synthesis of techniques.

Result: Provides a comprehensive synthesis of alignment techniques and identifies open challenges in the field.

Conclusion: The work serves as an essential roadmap for researchers building more robust, efficient, and equitable AI systems by outlining the new frontier of alignment research.

Abstract: Reinforcement Learning from Human Feedback (RLHF) is the standard for aligning Large Language Models (LLMs), yet recent progress has moved beyond canonical text-based methods. This survey synthesizes the new frontier of alignment research by addressing critical gaps in multi-modal alignment, cultural fairness, and low-latency optimization. To systematically explore these domains, we first review foundational algo- rithms, including PPO, DPO, and GRPO, before presenting a detailed analysis of the latest innovations. By providing a comparative synthesis of these techniques and outlining open challenges, this work serves as an essential roadmap for researchers building more robust, efficient, and equitable AI systems.

[265] Conditional Score Learning for Quickest Change Detection in Markov Transition Kernels

Wuxia Chen, Taposh Banerjee, Vahid Tarokh

Main category: cs.LG

TL;DR: The paper proposes a score-based CUSUM method for quickest change detection in high-dimensional Markov processes with unknown transition kernels, using conditional score learning instead of explicit likelihood evaluation.

Details

Motivation: To address the challenge of change detection in Markov processes where transition kernels are unknown and data is high-dimensional, avoiding the need for explicit likelihood evaluation which is often impractical.

Method: Learn conditional scores ∇_y log p(y|x) directly from sample pairs, develop a score-based CUSUM procedure using conditional Hyvarinen score differences, and propose a truncated statistic to ensure bounded increments.

Result: Proved exponential lower bounds on mean time to false alarm using Hoeffding’s inequality for uniformly ergodic Markov processes, and established asymptotic upper bounds on detection delay.

Conclusion: The method provides both theoretical guarantees and practical feasibility for score-based change detection in high-dimensional Markov models with unknown transition dynamics.

Abstract: We address the problem of quickest change detection in Markov processes with unknown transition kernels. The key idea is to learn the conditional score $\nabla_{\mathbf{y}} \log p(\mathbf{y}|\mathbf{x})$ directly from sample pairs $( \mathbf{x},\mathbf{y})$, where both $\mathbf{x}$ and $\mathbf{y}$ are high-dimensional data generated by the same transition kernel. In this way, we avoid explicit likelihood evaluation and provide a practical way to learn the transition dynamics. Based on this estimation, we develop a score-based CUSUM procedure that uses conditional Hyvarinen score differences to detect changes in the kernel. To ensure bounded increments, we propose a truncated version of the statistic. With Hoeffding’s inequality for uniformly ergodic Markov processes, we prove exponential lower bounds on the mean time to false alarm. We also prove asymptotic upper bounds on detection delay. These results give both theoretical guarantees and practical feasibility for score-based detection in high-dimensional Markov models.

[266] PrivacyCD: Hierarchical Unlearning for Protecting Student Privacy in Cognitive Diagnosis

Mingliang Hou, Yinuo Wang, Teng Guo, Zitao Liu, Wenzhou Dou, Jiaqi Zheng, Renqiang Luo, Mi Tian, Weiqi Luo

Main category: cs.LG

TL;DR: This paper introduces HIF, a novel data unlearning algorithm specifically designed for cognitive diagnosis models to address privacy concerns and users’ right to be forgotten.

Details

Motivation: Growing privacy concerns and users' assertion of their 'right to be forgotten' require effective data removal mechanisms in cognitive diagnosis models, which currently lack privacy considerations.

Method: Proposes hierarchical importance-guided forgetting (HIF), which leverages layer-wise parameter importance characteristics through a smoothing mechanism combining individual and layer-level importance.

Result: Experiments on three real-world datasets show HIF significantly outperforms baselines on key metrics, effectively balancing unlearning completeness, model utility, and efficiency.

Conclusion: HIF provides the first effective solution for CD models to handle user data removal requests while maintaining performance, enabling privacy-preserving AI systems.

Abstract: The need to remove specific student data from cognitive diagnosis (CD) models has become a pressing requirement, driven by users’ growing assertion of their “right to be forgotten”. However, existing CD models are largely designed without privacy considerations and lack effective data unlearning mechanisms. Directly applying general purpose unlearning algorithms is suboptimal, as they struggle to balance unlearning completeness, model utility, and efficiency when confronted with the unique heterogeneous structure of CD models. To address this, our paper presents the first systematic study of the data unlearning problem for CD models, proposing a novel and efficient algorithm: hierarchical importanceguided forgetting (HIF). Our key insight is that parameter importance in CD models exhibits distinct layer wise characteristics. HIF leverages this via an innovative smoothing mechanism that combines individual and layer, level importance, enabling a more precise distinction of parameters associated with the data to be unlearned. Experiments on three real world datasets show that HIF significantly outperforms baselines on key metrics, offering the first effective solution for CD models to respond to user data removal requests and for deploying high-performance, privacy preserving AI systems

[267] Non-Asymptotic Optimization and Generalization Bounds for Stochastic Gauss-Newton in Overparameterized Models

Semih Cayci

Main category: cs.LG

TL;DR: Analysis of stochastic Gauss-Newton method for training overparameterized deep neural networks, establishing finite-time convergence bounds and non-asymptotic generalization bounds using uniform stability.

Details

Motivation: To understand how higher-order optimization methods affect generalization in deep learning, specifically examining stochastic Gauss-Newton with Levenberg-Marquardt damping for overparameterized networks.

Method: Variable-metric analysis in parameter space for convergence bounds, and uniform stability analysis for generalization bounds, considering batch size, network width, depth, and curvature effects.

Result: Established finite-time convergence bounds with explicit dependencies on batch size, network width and depth, and derived non-asymptotic generalization bounds showing that larger minimum eigenvalue of Gauss-Newton matrix yields tighter stability bounds.

Conclusion: Identified a favorable generalization regime for SGN where curvature properties (larger minimum eigenvalue of Gauss-Newton matrix) improve generalization performance through tighter stability bounds.

Abstract: An important question in deep learning is how higher-order optimization methods affect generalization. In this work, we analyze a stochastic Gauss-Newton (SGN) method with Levenberg-Marquardt damping and mini-batch sampling for training overparameterized deep neural networks with smooth activations in a regression setting. Our theoretical contributions are twofold. First, we establish finite-time convergence bounds via a variable-metric analysis in parameter space, with explicit dependencies on the batch size, network width and depth. Second, we derive non-asymptotic generalization bounds for SGN using uniform stability in the overparameterized regime, characterizing the impact of curvature, batch size, and overparameterization on generalization performance. Our theoretical results identify a favorable generalization regime for SGN in which a larger minimum eigenvalue of the Gauss-Newton matrix along the optimization path yields tighter stability bounds.

[268] PETRA: Pretrained Evolutionary Transformer for SARS-CoV-2 Mutation Prediction

Xu Zou

Main category: cs.LG

TL;DR: PETRA is a transformer model that uses evolutionary trajectories from phylogenetic trees instead of raw RNA sequences to predict SARS-CoV-2 mutations, achieving significantly better performance than baselines.

Details

Motivation: SARS-CoV-2's rapid evolution and immune-evasive variants challenge public health and vaccine development. Direct application of GPTs to noisy viral genomic sequences is limited.

Method: Uses evolutionary trajectories from phylogenetic trees rather than raw RNA sequences, with weighted training to address geographical and temporal data imbalances.

Result: Achieved weighted recall@1 of 9.45% for nucleotide mutations and 17.10% for spike amino-acid mutations, significantly outperforming baselines (0.49% and 6.64% respectively). Successfully predicted mutations for major clades like 24F(XEC) and 25A(LP.8.1).

Conclusion: PETRA effectively mitigates sequencing noise, captures viral evolution hierarchy, and demonstrates strong performance in real-time mutation prediction for SARS-CoV-2 variants.

Abstract: Since its emergence, SARS-CoV-2 has demonstrated a rapid and unpredictable evolutionary trajectory, characterized by the continual emergence of immune-evasive variants. This poses persistent challenges to public health and vaccine development. While large-scale generative pre-trained transformers (GPTs) have revolutionized the modeling of sequential data, their direct applications to noisy viral genomic sequences are limited. In this paper, we introduce PETRA(Pretrained Evolutionary TRAnsformer), a novel transformer approach based on evolutionary trajectories derived from phylogenetic trees rather than raw RNA sequences. This method effectively mitigates sequencing noise and captures the hierarchical structure of viral evolution. With a weighted training framework to address substantial geographical and temporal imbalances in global sequence data, PETRA excels in predicting future SARS-CoV-2 mutations, achieving a weighted recall@1 of 9.45% for nucleotide mutations and 17.10% for spike amino-acid mutations, compared to 0.49% and 6.64% respectively for the best baseline. PETRA also demonstrates its ability to aid in the real-time mutation prediction of major clades like 24F(XEC) and 25A(LP.8.1). The code is open sourced on https://github.com/xz-keg/PETra

[269] Structural Priors and Modular Adapters in the Composable Fine-Tuning Algorithm of Large-Scale Models

Yuxiao Wang, Di Wu, Feng Liu, Zhimin Qiu, Chenrui Hu

Main category: cs.LG

TL;DR: A composable fine-tuning method using graph structural priors and modular adapters to improve computational efficiency and stability in multi-task adaptation for large pre-trained models.

Details

Motivation: To address high computational costs and structural instability in multi-task adaptation of large-scale pre-trained models by leveraging structural priors for better parameter efficiency and training stability.

Method: Integrates graph structural priors via relation matrices to model task dependencies, with modular adapters embedded through low-rank mapping and pluggable mechanisms for cross-task composition and reuse.

Result: Significantly enhances task prediction accuracy, adapter weight allocation precision, and computational efficiency while maintaining model lightweight design.

Conclusion: The framework demonstrates synergistic advantages of graph priors and modular mechanisms in composable fine-tuning, with verified consistency and superior performance under structural constraints.

Abstract: This paper proposes a composable fine-tuning method that integrates graph structural priors with modular adapters to address the high computational cost and structural instability faced by large-scale pre-trained models in multi-task adaptation. The method introduces a relation matrix to model dependencies among tasks, explicitly encoding correlations between nodes and paths into graph structural priors, which provide unified structural constraints for adapter weight allocation and path selection. Modular adapters are embedded into different layers through low-rank mapping and a pluggable mechanism, enabling efficient cross-task composition and reuse under prior guidance. This mechanism not only improves parameter efficiency and training stability but also alleviates path conflicts and redundant computation in multi-task scenarios. Furthermore, experiments on hyperparameter sensitivity, environmental sensitivity, and data sensitivity are conducted to systematically analyze key factors such as routing temperature, gating thresholds, and relation matrix regularization strength, verifying the consistency and superior performance of the method under structural constraints. The results demonstrate that the proposed framework significantly enhances task prediction accuracy, adapter weight allocation precision, and overall computational efficiency while maintaining model lightweight design, highlighting the synergistic advantages of graph priors and modular mechanisms in composable fine-tuning.

[270] TwIST: Rigging the Lottery in Transformers with Independent Subnetwork Training

Michael Menezes, Barbara Su, Xinze Feng, Yehya Farhat, Hamza Shili, Anastasios Kyrillidis

Main category: cs.LG

TL;DR: TwIST is a distributed training framework that identifies high-quality sparse subnetworks during training, enabling zero-cost pruning at deployment with competitive performance to post-training methods.

Details

Motivation: To enable efficient sparse LLM deployment without the overhead of post-training procedures like calibration or Hessian-based recovery, and to produce structured sparsity that works well on commodity hardware.

Method: Trains multiple subnetworks in parallel, periodically aggregates their parameters, and resamples new subnetworks during training to identify “golden tickets” - high-quality sparse subnetworks.

Result: Achieves competitive perplexity (23.14 PPL) compared to state-of-the-art post-training methods, with significant improvements under aggressive sparsity (50%+), outperforming closest prior approach (31.64 PPL).

Conclusion: TwIST provides an efficient training-time path to deployable sparse LLMs without additional fine-tuning or recovery overhead, producing structured dense matrices that offer practical inference speedups on commodity hardware.

Abstract: We introduce TwIST, a distributed training framework for efficient large language model (LLM) sparsification. TwIST trains multiple subnetworks in parallel, periodically aggregates their parameters, and resamples new subnetworks during training. This process identifies high-quality subnetworks (“golden tickets”) without requiring post-training procedures such as calibration or Hessian-based recovery. As a result, TwIST enables zero-cost pruning at deployment time while achieving perplexity competitive with state-of-the-art post-training sparsification methods. The benefits are most pronounced under aggressive sparsity (e.g., 50%+), where TwIST significantly outperforms baseline methods; for example, reaching 23.14 PPL compared to 31.64 for the closest prior approach. Unlike unstructured pruning, TwIST produces structured, dense matrices that offer practical inference speedups and memory reductions on commodity hardware (e.g., CPUs) that do not support efficient sparse computation. TwIST provides an efficient training-time path to deployable sparse LLMs without additional fine-tuning or recovery overhead.

[271] Use of Continuous Glucose Monitoring with Machine Learning to Identify Metabolic Subphenotypes and Inform Precision Lifestyle Changes

Ahmed A. Metwally, Heyjun Park, Yue Wu, Tracey McLaughlin, Michael P. Snyder

Main category: cs.LG

TL;DR: CGM and wearables enable dynamic metabolic phenotyping to identify distinct dysglycemia subtypes beyond traditional diabetes classification, allowing for personalized prevention strategies.

Details

Motivation: Traditional static glucose thresholds obscure pathophysiological heterogeneity in dysglycemia driven by insulin resistance, beta-cell dysfunction, and incretin deficiency.

Method: Use continuous glucose monitoring and wearable technologies with machine learning models to analyze high-resolution glucose data from at-home oral glucose tolerance tests and real-world nutrition responses.

Result: Machine learning can accurately predict gold-standard measures of muscle IR and beta-cell function, identify unique postprandial glycemic responses as metabolic biomarkers, and reveal phenotype-specific lifestyle associations.

Conclusion: CGM enables deconstruction of early dysglycemia into actionable subphenotypes, paving the way for targeted nutritional, behavioral, and pharmacological strategies tailored to individual metabolic defects for precision diabetes prevention.

Abstract: The classification of diabetes and prediabetes by static glucose thresholds obscures the pathophysiological dysglycemia heterogeneity, primarily driven by insulin resistance (IR), beta-cell dysfunction, and incretin deficiency. This review demonstrates that continuous glucose monitoring and wearable technologies enable a paradigm shift towards non-invasive, dynamic metabolic phenotyping. We show evidence that machine learning models can leverage high-resolution glucose data from at-home, CGM-enabled oral glucose tolerance tests to accurately predict gold-standard measures of muscle IR and beta-cell function. This personalized characterization extends to real-world nutrition, where an individual’s unique postprandial glycemic response (PPGR) to standardized meals, such as the relative glucose spike to potatoes versus grapes, could serve as a biomarker for their metabolic subtype. Moreover, integrating wearable data reveals that habitual diet, sleep, and physical activity patterns, particularly their timing, are uniquely associated with specific metabolic dysfunctions, informing precision lifestyle interventions. The efficacy of dietary mitigators in attenuating PPGR is also shown to be phenotype-dependent. Collectively, this evidence demonstrates that CGM can deconstruct the complexity of early dysglycemia into distinct, actionable subphenotypes. This approach moves beyond simple glycemic control, paving the way for targeted nutritional, behavioral, and pharmacological strategies tailored to an individual’s core metabolic defects, thereby paving the way for a new era of precision diabetes prevention.

[272] Multiscale Astrocyte Network Calcium Dynamics for Biologically Plausible Intelligence in Anomaly Detection

Berk Iskar, Michael Taynnan Barros

Main category: cs.LG

TL;DR: A Ca²⁺-modulated learning framework inspired by astrocytic Ca²⁺ signaling in the brain improves network anomaly detection by enabling rapid, context-sensitive adaptation to concept drift and new threats.

Details

Motivation: Traditional offline-trained network anomaly detectors are vulnerable to concept drift and new threats like zero-day attacks, requiring adaptive solutions that can handle evolving data patterns.

Method: Couples a multicellular astrocyte dynamics simulator (modeling IP₃-mediated Ca²⁺ release, SERCA pump uptake, and gap junction diffusion) with a deep neural network to create a biologically plausible adaptive learning framework.

Result: Outperforms baseline DNN with ~98% accuracy, reduced false positives/negatives across multiple train/test splits on CTU-13 network traffic data, with negligible runtime overhead after Ca²⁺ trajectory precomputation.

Conclusion: The Ca²⁺-modulated framework provides a generic solution for streaming detection tasks requiring rapid, biologically grounded adaptation to evolving data patterns, demonstrated effectively for cybersecurity applications.

Abstract: Network anomaly detection systems encounter several challenges with traditional detectors trained offline. They become susceptible to concept drift and new threats such as zero-day or polymorphic attacks. To address this limitation, we propose a Ca$^{2+}$-modulated learning framework that draws inspiration from astrocytic Ca$^{2+}$ signaling in the brain, where rapid, context-sensitive adaptation enables robust information processing. Our approach couples a multicellular astrocyte dynamics simulator with a deep neural network (DNN). The simulator models astrocytic Ca$^{2+}$ dynamics through three key mechanisms: IP$_3$-mediated Ca$^{2+}$ release, SERCA pump uptake, and conductance-aware diffusion through gap junctions between cells. Evaluation of our proposed network on CTU-13 (Neris) network traffic data demonstrates the effectiveness of our biologically plausible approach. The Ca$^{2+}$-gated model outperforms a matched baseline DNN, achieving up to $\sim$98% accuracy with reduced false positives and negatives across multiple train/test splits. Importantly, this improved performance comes with negligible runtime overhead once Ca$^{2+}$ trajectories are precomputed. While demonstrated here for cybersecurity applications, this Ca$^{2+}$-modulated learning framework offers a generic solution for streaming detection tasks that require rapid, biologically grounded adaptation to evolving data patterns.

[273] Towards Scalable Meta-Learning of near-optimal Interpretable Models via Synthetic Model Generations

Kyaw Hpone Myint, Zhe Wu, Alexandre G. R. Day, Giri Iyengar

Main category: cs.LG

TL;DR: Efficient synthetic data generation method for meta-learning decision trees that matches real-world data performance while reducing computational costs.

Details

Motivation: Decision trees are crucial in high-stakes domains due to interpretability, but meta-learning them requires large datasets which are expensive to obtain with optimal trees.

Method: Sample near-optimal decision trees synthetically to create large-scale datasets, then use MetaTree transformer architecture for meta-learning.

Result: Achieves performance comparable to pre-training on real-world data or with computationally expensive optimal decision trees.

Conclusion: This approach significantly reduces computational costs, enhances data generation flexibility, and enables scalable meta-learning of interpretable decision tree models.

Abstract: Decision trees are widely used in high-stakes fields like finance and healthcare due to their interpretability. This work introduces an efficient, scalable method for generating synthetic pre-training data to enable meta-learning of decision trees. Our approach samples near-optimal decision trees synthetically, creating large-scale, realistic datasets. Using the MetaTree transformer architecture, we demonstrate that this method achieves performance comparable to pre-training on real-world data or with computationally expensive optimal decision trees. This strategy significantly reduces computational costs, enhances data generation flexibility, and paves the way for scalable and efficient meta-learning of interpretable decision tree models.

[274] Accelerating scientific discovery with the common task framework

J. Nathan Kutz, Peter Battaglia, Michael Brenner, Kevin Carlberg, Aric Hagberg, Shirley Ho, Stephan Hoyer, Henning Lange, Hod Lipson, Michael W. Mahoney, Frank Noe, Max Welling, Laure Zanna, Francis Zhu, Steven L. Brunton

Main category: cs.LG

TL;DR: A common task framework (CTF) is proposed to evaluate ML/AI algorithms in science and engineering, providing standardized metrics and challenge datasets for diverse objectives like forecasting, state reconstruction, and control.

Details

Motivation: There is a critical need for objective metrics to compare diverse ML/AI algorithms being rapidly developed and deployed across science and engineering domains, similar to frameworks that advanced ML in traditional applications.

Method: Introduce a common task framework featuring growing collection of challenge datasets with diverse practical objectives, considering limited data scenarios and noisy measurements.

Result: The CTF provides standardized comparative metrics for evaluating ML/AI algorithms across multiple scientific objectives including forecasting, state reconstruction, generalization, and control.

Conclusion: A common task framework is essential for advancing ML/AI applications in science and engineering by enabling objective comparison of algorithms and accelerating progress through standardized evaluation metrics.

Abstract: Machine learning (ML) and artificial intelligence (AI) algorithms are transforming and empowering the characterization and control of dynamic systems in the engineering, physical, and biological sciences. These emerging modeling paradigms require comparative metrics to evaluate a diverse set of scientific objectives, including forecasting, state reconstruction, generalization, and control, while also considering limited data scenarios and noisy measurements. We introduce a common task framework (CTF) for science and engineering, which features a growing collection of challenge data sets with a diverse set of practical and common objectives. The CTF is a critically enabling technology that has contributed to the rapid advance of ML/AI algorithms in traditional applications such as speech recognition, language processing, and computer vision. There is a critical need for the objective metrics of a CTF to compare the diverse algorithms being rapidly developed and deployed in practice today across science and engineering.

[275] Memory- and Latency-Constrained Inference of Large Language Models via Adaptive Split Computing

Mingyu Sung, Vikas Palakonda, Suhwan Im, Sunghwan Moon, Il-Min Kim, Sangseok Yun, Jae-Mo Kang

Main category: cs.LG

TL;DR: This paper introduces an autoregressive-aware split computing framework for deploying large language models on edge devices, addressing memory and communication bottlenecks through strategic partitioning, compression, and optimization.

Details

Motivation: LLMs achieve near-human reasoning but are impractical for IoT devices due to massive parameters and memory-intensive decoding. Split computing between edge and cloud is promising but existing approaches fail to address autoregressive inference challenges like iterative token generation and KV cache requirements.

Method: Three key contributions: 1) One-point split compression (OPSC) with mixed-precision quantization to prevent out-of-memory failures, 2) Two-stage intermediate compression pipeline combining threshold splitting and token-wise adaptive bit quantization to reduce communication overhead, 3) Unified optimization framework for selecting optimal split points, quantization settings, and sequence lengths.

Result: Extensive evaluations show superior performance over state-of-the-art quantization methods (SmoothQuant, OmniQuant, Atom), achieving 1.49× inference speedup and significant communication overhead reduction while maintaining or improving model accuracy across diverse LLMs and hardware platforms.

Conclusion: The proposed framework successfully enables practical LLM deployment on resource-constrained edge devices through autoregressive-aware split computing, effectively balancing memory, latency, and accuracy constraints.

Abstract: Large language models (LLMs) have achieved near-human performance across diverse reasoning tasks, yet their deployment on resource-constrained Internet-of-Things (IoT) devices remains impractical due to massive parameter footprints and memory-intensive autoregressive decoding. While split computing offers a promising solution by partitioning model execution between edge devices and cloud servers, existing approaches fail to address the unique challenges of autoregressive inference, particularly the iterative token generation process and expanding key-value (KV) cache requirements. This work introduces the first autoregressive-aware split computing framework designed explicitly for LLM deployment on edge devices. Our approach makes three key contributions. First, we develop one-point split compression (OPSC), a mixed-precision quantization scheme that prevents out-of-memory failures by strategically partitioning models into front-end and back-end segments with different precision levels. Second, we propose a two-stage intermediate compression pipeline that combines threshold splitting (TS) and token-wise adaptive bit quantization (TAB-Q) to preserve accuracy-critical activations while dramatically reducing communication overhead. Third, we formulate a unified optimization framework that jointly selects optimal split points, quantization settings, and sequence lengths to satisfy strict memory and latency constraints. Extensive evaluations across diverse LLMs and hardware platforms demonstrate superior performance compared to state-of-the-art quantization methods, including SmoothQuant, OmniQuant, and Atom. The framework achieves a 1.49 inference speedup and significant communication overhead reduction while maintaining or improving model accuracy.

[276] Enhancing Multimodal Protein Function Prediction Through Dual-Branch Dynamic Selection with Reconstructive Pre-Training

Xiaoling Luo, Peng Chen, Chengliang Liu, Xiaopeng Jin, Jie Wen, Yumeng Liu, Junsong Wang

Main category: cs.LG

TL;DR: DSRPGO is a multimodal protein function prediction method that uses dynamic selection and reconstructive pre-training to handle complex protein features and hierarchical multi-label classification.

Details

Motivation: Multimodal protein features contain diverse information but their complex interconnections make protein function prediction challenging.

Method: Uses reconstructive pre-training for fine-grained information, Bidirectional Interaction Module for multimodal feature learning, and Dynamic Selection Module for optimal feature representation in hierarchical classification.

Result: Significantly improves performance in BPO, MFO, and CCO on human datasets compared to benchmark models.

Conclusion: DSRPGO effectively handles multimodal protein features and hierarchical classification for improved protein function prediction.

Abstract: Multimodal protein features play a crucial role in protein function prediction. However, these features encompass a wide range of information, ranging from structural data and sequence features to protein attributes and interaction networks, making it challenging to decipher their complex interconnections. In this work, we propose a multimodal protein function prediction method (DSRPGO) by utilizing dynamic selection and reconstructive pre-training mechanisms. To acquire complex protein information, we introduce reconstructive pre-training to mine more fine-grained information with low semantic levels. Moreover, we put forward the Bidirectional Interaction Module (BInM) to facilitate interactive learning among multimodal features. Additionally, to address the difficulty of hierarchical multi-label classification in this task, a Dynamic Selection Module (DSM) is designed to select the feature representation that is most conducive to current protein function prediction. Our proposed DSRPGO model improves significantly in BPO, MFO, and CCO on human datasets, thereby outperforming other benchmark models.

[277] DartQuant: Efficient Rotational Distribution Calibration for LLM Quantization

Yuantian Shao, Yuanteng Chen, Peisong Wang, Jianlin Yu, Jing Lin, Yiwu Yao, Zhihui Wei, Jian Cheng

Main category: cs.LG

TL;DR: DartQuant is an efficient distribution-aware rotational calibration method that reduces computational costs and overfitting in model quantization by constraining activation distributions after rotation and using QR-Orth optimization.

Details

Motivation: End-to-end fine-tuning of rotational optimization algorithms for quantization incurs high computational costs and is prone to overfitting, making it challenging for large-scale models.

Method: Proposes DartQuant with distribution-aware rotational calibration that constrains activation distributions after rotation, and introduces QR-Orth optimization scheme to replace expensive alternating optimization.

Result: Achieves 47× acceleration and 10× memory savings for rotational optimization on a 70B model, and successfully completes rotational calibration for a 70B model on a single 3090 GPU.

Conclusion: DartQuant makes quantization of large language models feasible in resource-constrained environments with superior performance across various model quantization experiments.

Abstract: Quantization plays a crucial role in accelerating the inference of large-scale models, and rotational matrices have been shown to effectively improve quantization performance by smoothing outliers. However, end-to-end fine-tuning of rotational optimization algorithms incurs high computational costs and is prone to overfitting. To address this challenge, we propose an efficient distribution-aware rotational calibration method, DartQuant, which reduces the complexity of rotational optimization by constraining the distribution of the activations after rotation. This approach also effectively reduces reliance on task-specific losses, thereby mitigating the risk of overfitting. Additionally, we introduce the QR-Orth optimization scheme, which replaces expensive alternating optimization with a more efficient solution. In a variety of model quantization experiments, DartQuant demonstrates superior performance. Compared to existing methods, it achieves 47$\times$ acceleration and 10$\times$ memory savings for rotational optimization on a 70B model. Furthermore, it is the first to successfully complete rotational calibration for a 70B model on a single 3090 GPU, making quantization of large language models feasible in resource-constrained environments. Code is available at https://github.com/CAS-CLab/DartQuant.git.

[278] Pediatric Appendicitis Detection from Ultrasound Images

Fatemeh Hosseinabadi, Seyedhassan Sharifi

Main category: cs.LG

TL;DR: Deep learning model using pretrained ResNet achieves 93.44% accuracy in automated detection of pediatric appendicitis from ultrasound images.

Details

Motivation: Pediatric appendicitis diagnosis is challenging due to overlapping symptoms and variable imaging quality, requiring automated tools to assist clinicians.

Method: Fine-tuned ResNet architecture on Regensburg Pediatric Appendicitis Dataset with ultrasound images, using preprocessing (normalization, resizing, augmentation) for classification between appendicitis and non-appendicitis cases.

Result: Model achieved 93.44% accuracy, 91.53% precision, and 89.8% recall, demonstrating strong performance across heterogeneous ultrasound views despite challenges like low contrast and speckle noise.

Conclusion: The ResNet-based deep learning model effectively identifies pediatric appendicitis from ultrasound images, learning discriminative spatial features that overcome imaging challenges in pediatric cases.

Abstract: Pediatric appendicitis remains one of the most common causes of acute abdominal pain in children, and its diagnosis continues to challenge clinicians due to overlapping symptoms and variable imaging quality. This study aims to develop and evaluate a deep learning model based on a pretrained ResNet architecture for automated detection of appendicitis from ultrasound images. We used the Regensburg Pediatric Appendicitis Dataset, which includes ultrasound scans, laboratory data, and clinical scores from pediatric patients admitted with abdominal pain to Children Hospital. Hedwig in Regensburg, Germany. Each subject had 1 to 15 ultrasound views covering the right lower quadrant, appendix, lymph nodes, and related structures. For the image based classification task, ResNet was fine tuned to distinguish appendicitis from non-appendicitis cases. Images were preprocessed by normalization, resizing, and augmentation to enhance generalization. The proposed ResNet model achieved an overall accuracy of 93.44, precision of 91.53, and recall of 89.8, demonstrating strong performance in identifying appendicitis across heterogeneous ultrasound views. The model effectively learned discriminative spatial features, overcoming challenges posed by low contrast, speckle noise, and anatomical variability in pediatric imaging.

[279] Left Atrial Segmentation with nnU-Net Using MRI

Fatemeh Hosseinabadi, Seyedhassan Sharifi

Main category: cs.LG

TL;DR: Applied nnU-Net framework to automatically segment left atrium from cardiac MRI, achieving high accuracy (93.5 Dice score) and outperforming traditional methods.

Details

Motivation: Manual LA segmentation is time-consuming, observer-dependent, and impractical for clinical workflows. Deep learning can provide automated, accurate segmentation for AF ablation guidance and cardiac modeling.

Method: Used nnU-Net framework on Left Atrial Segmentation Challenge 2013 dataset (30 MRI scans with expert annotations). The model automatically adapted preprocessing, network configuration, and training pipeline to MRI data characteristics.

Result: Achieved mean Dice score of 93.5, showing high overlap with expert annotations and outperforming previous traditional segmentation approaches. Demonstrated robust generalization across variations in LA shape, contrast, and image quality.

Conclusion: nnU-Net provides accurate and automated LA segmentation from cardiac MRI, suitable for clinical applications in AF ablation and cardiac modeling workflows.

Abstract: Accurate segmentation of the left atrium (LA) from cardiac MRI is critical for guiding atrial fibrillation (AF) ablation and constructing biophysical cardiac models. Manual delineation is time-consuming, observer-dependent, and impractical for large-scale or time-sensitive clinical workflows. Deep learning methods, particularly convolutional architectures, have recently demonstrated superior performance in medical image segmentation tasks. In this study, we applied the nnU-Net framework, an automated, self-configuring deep learning segmentation architecture, to the Left Atrial Segmentation Challenge 2013 dataset. The dataset consists of thirty MRI scans with corresponding expert-annotated masks. The nnU-Net model automatically adapted its preprocessing, network configuration, and training pipeline to the characteristics of the MRI data. Model performance was quantitatively evaluated using the Dice similarity coefficient (DSC), and qualitative results were compared against expert segmentations. The proposed nnUNet model achieved a mean Dice score of 93.5, demonstrating high overlap with expert annotations and outperforming several traditional segmentation approaches reported in previous studies. The network exhibited robust generalization across variations in left atrial shape, contrast, and image quality, accurately delineating both the atrial body and proximal pulmonary veins.

[280] Learning Filter-Aware Distance Metrics for Nearest Neighbor Search with Multiple Filters

Ananya Sutradhar, Suryansh Gupta, Ravishankar Krishnaswamy, Haiyang Xu, Aseem Rastogi, Gopal Srinivasa

Main category: cs.LG

TL;DR: A learned approach for filtered approximate nearest neighbor search that adapts distance functions to data distributions, outperforming fixed-penalty methods by 5-10% in accuracy.

Details

Motivation: Existing graph-based filtered ANN methods use fixed penalties for filter constraints, which fail to generalize across datasets with diverse label and vector distributions.

Method: Formulate filtered ANN as a constrained linear optimization problem to learn optimal trade-off weights between vector distance and filter match directly from data, guiding both search and index construction.

Result: Experiments show 5-10% accuracy improvement over fixed-penalty methods by better capturing filter distribution and semantics.

Conclusion: Learning data-adaptive distance functions provides a more flexible and generalizable framework for filtered ANN search compared to fixed-penalty approaches.

Abstract: Filtered Approximate Nearest Neighbor (ANN) search retrieves the closest vectors for a query vector from a dataset. It enforces that a specified set of discrete labels $S$ for the query must be included in the labels of each retrieved vector. Existing graph-based methods typically incorporate filter awareness by assigning fixed penalties or prioritizing nodes based on filter satisfaction. However, since these methods use fixed, data in- dependent penalties, they often fail to generalize across datasets with diverse label and vector distributions. In this work, we propose a principled alternative that learns the optimal trade-off between vector distance and filter match directly from the data, rather than relying on fixed penalties. We formulate this as a constrained linear optimization problem, deriving weights that better reflect the underlying filter distribution and more effectively address the filtered ANN search problem. These learned weights guide both the search process and index construction, leading to graph structures that more effectively capture the underlying filter distribution and filter semantics. Our experiments demonstrate that adapting the distance function to the data significantly im- proves accuracy by 5-10% over fixed-penalty methods, providing a more flexible and generalizable framework for the filtered ANN search problem.

[281] DeNoise: Learning Robust Graph Representations for Unsupervised Graph-Level Anomaly Detection

Qingfeng Chen, Haojin Zeng, Jingyi Jie, Shichao Zhang, Debo Cheng

Main category: cs.LG

TL;DR: DeNoise is a robust unsupervised graph-level anomaly detection framework that handles contaminated training data by learning noise-resistant embeddings through adversarial optimization and contrastive learning.

Details

Motivation: Most Graph Neural Network approaches assume clean training data with only normal graphs, but real-world datasets often contain anomalous graphs that distort learned representations and degrade performance.

Method: Jointly optimizes graph-level encoder, attribute decoder, and structure decoder via adversarial objective; uses encoder anchor-alignment denoising mechanism to fuse node embeddings from normal graphs; employs contrastive learning to compact normal embeddings and repel anomalous ones.

Result: Extensive experiments on eight real-world datasets show DeNoise consistently learns reliable graph-level representations under varying noise intensities and significantly outperforms state-of-the-art UGAD baselines.

Conclusion: DeNoise effectively addresses the challenge of contaminated training data in unsupervised graph-level anomaly detection through its robust framework design.

Abstract: With the rapid growth of graph-structured data in critical domains, unsupervised graph-level anomaly detection (UGAD) has become a pivotal task. UGAD seeks to identify entire graphs that deviate from normal behavioral patterns. However, most Graph Neural Network (GNN) approaches implicitly assume that the training set is clean, containing only normal graphs, which is rarely true in practice. Even modest contamination by anomalous graphs can distort learned representations and sharply degrade performance. To address this challenge, we propose DeNoise, a robust UGAD framework explicitly designed for contaminated training data. It jointly optimizes a graph-level encoder, an attribute decoder, and a structure decoder via an adversarial objective to learn noise-resistant embeddings. Further, DeNoise introduces an encoder anchor-alignment denoising mechanism that fuses high-information node embeddings from normal graphs into all graph embeddings, improving representation quality while suppressing anomaly interference. A contrastive learning component then compacts normal graph embeddings and repels anomalous ones in the latent space. Extensive experiments on eight real-world datasets demonstrate that DeNoise consistently learns reliable graph-level representations under varying noise intensities and significantly outperforms state-of-the-art UGAD baselines.

[282] KoTaP: A Panel Dataset for Corporate Tax Avoidance, Performance, and Governance in Korea

Hyungjong Na, Wonho Song, Seungyong Han, Donghyeon Jo, Sejin Myung, Hyungjoon Kim

Main category: cs.LG

TL;DR: Introduces KoTaP, a Korean Tax Avoidance Panel dataset covering 2011-2024 with 12,653 firm-year observations from 1,754 non-financial firms, designed to study tax avoidance as a predictor variable across multiple business domains.

Details

Motivation: To create a standardized, long-term panel dataset for studying corporate tax avoidance in Korean firms that provides both international comparability and reflects unique Korean institutional features like concentrated ownership and high foreign shareholding.

Method: Constructed a balanced panel dataset from KOSPI and KOSDAQ listed firms (2011-2024), excluding financial firms, non-December fiscal years, capital impairment, and negative pre-tax income. Tax avoidance measured using CETR, GETR, TSTA, and TSDA indicators with adjustments for interpretability.

Result: Created KoTaP dataset with standardized variables covering earnings management, profitability, stability, growth, and governance domains. The dataset shows consistency with international literature while capturing distinctive Korean institutional characteristics.

Conclusion: KoTaP serves as a critical open resource enabling applications in econometric modeling, deep learning, policy evaluation, audit planning, and investment analysis for accounting, finance, and interdisciplinary research.

Abstract: This study introduces the Korean Tax Avoidance Panel (KoTaP), a long-term panel dataset of non-financial firms listed on KOSPI and KOSDAQ between 2011 and 2024. After excluding financial firms, firms with non-December fiscal year ends, capital impairment, and negative pre-tax income, the final dataset consists of 12,653 firm-year observations from 1,754 firms. KoTaP is designed to treat corporate tax avoidance as a predictor variable and link it to multiple domains, including earnings management (accrual- and activity-based), profitability (ROA, ROE, CFO, LOSS), stability (LEV, CUR, SIZE, PPE, AGE, INVREC), growth (GRW, MB, TQ), and governance (BIG4, FORN, OWN). Tax avoidance itself is measured using complementary indicators cash effective tax rate (CETR), GAAP effective tax rate (GETR), and book-tax difference measures (TSTA, TSDA) with adjustments to ensure interpretability. A key strength of KoTaP is its balanced panel structure with standardized variables and its consistency with international literature on the distribution and correlation of core indicators. At the same time, it reflects distinctive institutional features of Korean firms, such as concentrated ownership, high foreign shareholding, and elevated liquidity ratios, providing both international comparability and contextual uniqueness. KoTaP enables applications in benchmarking econometric and deep learning models, external validity checks, and explainable AI analyses. It further supports policy evaluation, audit planning, and investment analysis, making it a critical open resource for accounting, finance, and interdisciplinary research.

[283] Decomposable Neuro Symbolic Regression

Giorgio Morales, John W. Sheppard

Main category: cs.LG

TL;DR: A decomposable symbolic regression method that generates interpretable multivariate expressions by combining transformers, genetic algorithms, and genetic programming to distill opaque regression models into mathematical explanations.

Details

Motivation: Most symbolic regression methods prioritize prediction error minimization over identifying true governing equations, often producing overly complex or inaccurate expressions that lack interpretability.

Method: Uses Multi-Set Transformer to generate univariate symbolic skeletons, evaluates them with genetic algorithms, merges them via genetic programming while preserving structure, and optimizes coefficients with genetic algorithms.

Result: Achieved lower or comparable interpolation/extrapolation errors compared to GP-based methods, neural SR methods, and hybrid approaches, while consistently learning expressions matching the original mathematical structure.

Conclusion: The proposed decomposable SR method successfully generates interpretable multivariate expressions that accurately capture underlying mathematical relationships, outperforming existing approaches in both accuracy and structural fidelity.

Abstract: Symbolic regression (SR) models complex systems by discovering mathematical expressions that capture underlying relationships in observed data. However, most SR methods prioritize minimizing prediction error over identifying the governing equations, often producing overly complex or inaccurate expressions. To address this, we present a decomposable SR method that generates interpretable multivariate expressions leveraging transformer models, genetic algorithms (GAs), and genetic programming (GP). In particular, our explainable SR method distills a trained ``opaque’’ regression model into mathematical expressions that serve as explanations of its computed function. Our method employs a Multi-Set Transformer to generate multiple univariate symbolic skeletons that characterize how each variable influences the opaque model’s response. We then evaluate the generated skeletons’ performance using a GA-based approach to select a subset of high-quality candidates before incrementally merging them via a GP-based cascade procedure that preserves their original skeleton structure. The final multivariate skeletons undergo coefficient optimization via a GA. We evaluated our method on problems with controlled and varying degrees of noise, demonstrating lower or comparable interpolation and extrapolation errors compared to two GP-based methods, three neural SR methods, and a hybrid approach. Unlike them, our approach consistently learned expressions that matched the original mathematical structure.

[284] Exploring the Feasibility of End-to-End Large Language Model as a Compiler

Hongbin Zhang, Shihao Gao, Yang Liu, Mingjie Xing, Yanjun Wu, Chen Zhao

Main category: cs.LG

TL;DR: This paper explores using Large Language Models (LLMs) as end-to-end compilers (LaaC), evaluating their capabilities in source code comprehension and assembly generation, and identifying methods to improve compilation success rates.

Details

Motivation: While LLMs have been used to assist compiler development, their potential as complete end-to-end compilers remains largely unexplored. The authors aim to investigate whether LLMs can effectively replace traditional compilers.

Method: Created CompilerEval dataset and framework to evaluate mainstream LLMs’ compilation capabilities. Analyzed various error types, tested prompt optimization, model scaling, and reasoning methods to improve generated assembly code quality.

Result: LLMs demonstrate basic compilation capabilities but currently achieve low compilation success rates. Through optimized prompts, larger models, and reasoning methods, the quality of generated assembly code can be significantly improved.

Conclusion: The authors maintain an optimistic outlook for LaaC, proposing that with targeted training, knowledge-rich prompts, and specialized infrastructure, LLMs have the potential to generate high-quality assembly code and drive a paradigm shift in compilation technology.

Abstract: In recent years, end-to-end Large Language Model (LLM) technology has shown substantial advantages across various domains. As critical system software and infrastructure, compilers are responsible for transforming source code into target code. While LLMs have been leveraged to assist in compiler development and maintenance, their potential as an end-to-end compiler remains largely unexplored. This paper explores the feasibility of LLM as a Compiler (LaaC) and its future directions. We designed the CompilerEval dataset and framework specifically to evaluate the capabilities of mainstream LLMs in source code comprehension and assembly code generation. In the evaluation, we analyzed various errors, explored multiple methods to improve LLM-generated code, and evaluated cross-platform compilation capabilities. Experimental results demonstrate that LLMs exhibit basic capabilities as compilers but currently achieve low compilation success rates. By optimizing prompts, scaling up the model, and incorporating reasoning methods, the quality of assembly code generated by LLMs can be significantly enhanced. Based on these findings, we maintain an optimistic outlook for LaaC and propose practical architectural designs and future research directions. We believe that with targeted training, knowledge-rich prompts, and specialized infrastructure, LaaC has the potential to generate high-quality assembly code and drive a paradigm shift in the field of compilation.

[285] Exchange Policy Optimization Algorithm for Semi-Infinite Safe Reinforcement Learning

Jiaming Zhang, Yujie Yang, Haoning Wang, Liping Zhang, Shengbo Eben Li

Main category: cs.LG

TL;DR: EPO is a safe RL framework for semi-infinite constraint problems that iteratively solves subproblems with finite constraint sets and adaptively adjusts active constraints to achieve optimal performance with bounded safety violations.

Details

Motivation: Safe RL often faces infinite constraints when safety conditions must be enforced across continuous parameter spaces (e.g., resource distribution at every location), requiring new approaches beyond traditional finite-constraint methods.

Method: Exchange Policy Optimization (EPO) iteratively solves safe RL subproblems with finite constraint sets, adaptively adding constraints with violations exceeding tolerance and removing those with zero Lagrange multipliers to prevent uncontrolled set growth.

Result: EPO achieves performance comparable to optimal solutions while ensuring global constraint violations remain strictly within a prescribed bound under mild assumptions.

Conclusion: EPO provides an effective algorithmic framework for semi-infinite safe RL problems, enabling optimal policy training with deterministic bounded safety guarantees through adaptive constraint management.

Abstract: Safe reinforcement learning (safe RL) aims to respect safety requirements while optimizing long-term performance. In many practical applications, however, the problem involves an infinite number of constraints, known as semi-infinite safe RL (SI-safe RL). Such constraints typically appear when safety conditions must be enforced across an entire continuous parameter space, such as ensuring adequate resource distribution at every spatial location. In this paper, we propose exchange policy optimization (EPO), an algorithmic framework that achieves optimal policy performance and deterministic bounded safety. EPO works by iteratively solving safe RL subproblems with finite constraint sets and adaptively adjusting the active set through constraint expansion and deletion. At each iteration, constraints with violations exceeding the predefined tolerance are added to refine the policy, while those with zero Lagrange multipliers are removed after the policy update. This exchange rule prevents uncontrolled growth of the working set and supports effective policy training. Our theoretical analysis demonstrates that, under mild assumptions, strategies trained via EPO achieve performance comparable to optimal solutions with global constraint violations strictly remaining within a prescribed bound.

[286] Block Rotation is All You Need for MXFP4 Quantization

Yuantian Shao, Peisong Wang, Yuanteng Chen, Chang Xu, Zhihui Wei, Jian Cheng

Main category: cs.LG

TL;DR: This paper establishes a comprehensive benchmark for post-training quantization (PTQ) methods using the new MXFP4 format, identifies incompatibility issues with rotation-based methods, analyzes the root cause, and proposes an effective block rotation solution.

Details

Motivation: Large language models (LLMs) face prohibitive costs due to their scale, and while PTQ offers efficient deployment, achieving accurate W4A4 quantization remains challenging. The emergence of MXFP4 format with hardware support raises questions about current PTQ methods' applicability.

Method: The authors conduct systematic evaluation of PTQ methods under MXFP4 format, analyze the conflict between MXFP4’s power-of-two block scaling and rotation-based methods, and propose a simple block rotation strategy to adapt rotation-based methods to MXFP4.

Result: GPTQ consistently performs well with MXFP4, but rotation-based methods suffer severe incompatibility. The proposed block rotation strategy leads to substantial accuracy improvements across diverse LLMs when adapting rotation-based methods to MXFP4.

Conclusion: The findings provide clear guidance for practitioners and establish a foundation for advancing PTQ research under emerging low-precision formats like MXFP4, with the proposed block rotation effectively resolving the incompatibility issue.

Abstract: Large language models (LLMs) have achieved remarkable success, but their rapidly growing scale imposes prohibitive costs in memory, computation, and energy. Post-training quantization (PTQ) is a promising solution for efficient deployment, yet achieving accurate W4A4 quantization remains an open challenge. While most existing methods are designed for INT4 formats, the emergence of MXFP4 – a new FP4 format with various hardware support (NVIDIA, AMD, Intel)– raises questions about the applicability of current techniques. In this work, we establish a comprehensive benchmark of PTQ methods under the MXFP4 format. Through systematic evaluation, we find that methods like GPTQ consistently deliver strong performance, whereas rotation-based approaches, which are almost used by all state-of-the-art approaches, suffer from severe incompatibility with MXFP4. We further provide the first in-depth analysis of this conflict, tracing its root to a fundamental mismatch between MXFP4’s PoT (power-of-two) block scaling and the redistribution of outlier energy via global rotation. Building on this insight, we propose a simple yet effective block rotation strategy that adapts rotation-based methods to MXFP4, leading to substantial accuracy improvements across diverse LLMs. Our findings not only offer clear guidance for practitioners but also set a foundation for advancing PTQ research under emerging low-precision formats.

[287] Learning to Land Anywhere: Transferable Generative Models for Aircraft Trajectories

Olav Finne Praesteng Larsen, Massimiliano Ruocco, Michail Spitieris, Abdulmajid Murad, Martina Ragosta

Main category: cs.LG

TL;DR: Transfer learning enables generative models trained on data-rich airports to be adapted to data-scarce airports, achieving competitive performance with only 5-20% of target data.

Details

Motivation: Many secondary and regional airports face severe data scarcity, limiting the applicability of machine learning methods and large-scale simulations in Air Traffic Management.

Method: Adapt diffusion- and flow-matching-based architectures to aviation domain, pretrain on Zurich data, then fine-tune on Dublin data with varying amounts (0-100%) of local data.

Result: Diffusion-based models achieve competitive performance with 5% Dublin data and reach baseline performance at 20%, consistently outperforming models trained from scratch across metrics.

Conclusion: Transfer learning can substantially reduce data requirements for trajectory generation in ATM, enabling realistic synthetic data generation even in environments with limited historical records.

Abstract: Access to trajectory data is a key requirement for developing and validating Air Traffic Management (ATM) solutions, yet many secondary and regional airports face severe data scarcity. This limits the applicability of machine learning methods and the ability to perform large-scale simulations or “what-if” analyses. In this paper, we investigate whether generative models trained on data-rich airports can be efficiently adapted to data-scarce airports using transfer learning. We adapt state-of-the-art diffusion- and flow-matching-based architectures to the aviation domain and evaluate their transferability between Zurich (source) and Dublin (target) landing trajectory datasets. Models are pretrained on Zurich and fine-tuned on Dublin with varying amounts of local data, ranging from 0% to 100%. Results show that diffusion-based models achieve competitive performance with as little as 5% of the Dublin data and reach baseline-level performance around 20%, consistently outperforming models trained from scratch across metrics and visual inspections. Latent flow matching and latent diffusion models also benefit from pretraining, though with more variable gains, while flow matching models show weaker generalization. Despite challenges in capturing rare trajectory patterns, these findings demonstrate the potential of transfer learning to substantially reduce data requirements for trajectory generation in ATM, enabling realistic synthetic data generation even in environments with limited historical records.

[288] Deep Learning Approach for Clinical Risk Identification Using Transformer Modeling of Heterogeneous EHR Data

Anzhuo Xie, Wei-Chen Chang

Main category: cs.LG

TL;DR: Transformer-based method for clinical risk classification using heterogeneous EHR data that handles irregular temporal patterns, modality differences, and complex semantics through unified embedding, temporal encoding, and semantic-weighted pooling.

Details

Motivation: Address challenges in clinical risk classification with heterogeneous EHR data including irregular temporal patterns, large modality differences, and complex semantic structures.

Method: Uses feature embedding for unified representation, learnable temporal encoding for dynamic evolution, multi-head self-attention for global dependency modeling, semantic-weighted pooling for adaptive importance assignment, and linear mapping for risk scores.

Result: Outperforms traditional machine learning and temporal deep learning models in accuracy, recall, precision, and F1-Score, achieving stable and precise risk identification.

Conclusion: Provides an efficient and reliable framework for clinical intelligent decision-making in multi-source heterogeneous EHR environments.

Abstract: This study proposes a Transformer-based longitudinal modeling method to address challenges in clinical risk classification with heterogeneous Electronic Health Record (EHR) data, including irregular temporal patterns, large modality differences, and complex semantic structures. The method takes multi-source medical features as input and employs a feature embedding layer to achieve a unified representation of structured and unstructured data. A learnable temporal encoding mechanism is introduced to capture dynamic evolution under uneven sampling intervals. The core model adopts a multi-head self-attention structure to perform global dependency modeling on longitudinal sequences, enabling the aggregation of long-term trends and short-term fluctuations across different temporal scales. To enhance semantic representation, a semantic-weighted pooling module is designed to assign adaptive importance to key medical events, improving the discriminative ability of risk-related features. Finally, a linear mapping layer generates individual-level risk scores. Experimental results show that the proposed model outperforms traditional machine learning and temporal deep learning models in accuracy, recall, precision, and F1-Score, achieving stable and precise risk identification in multi-source heterogeneous EHR environments and providing an efficient and reliable framework for clinical intelligent decision-making.

[289] The Illusion of Certainty: Uncertainty quantification for LLMs fails under ambiguity

Tim Tomov, Dominik Fuchsgruber, Tom Wollschläger, Stephan Günnemann

Main category: cs.LG

TL;DR: Current uncertainty quantification methods for LLMs perform poorly on ambiguous data, degrading to near-random performance despite working well on unambiguous tasks.

Details

Motivation: Real-world language is inherently ambiguous, but existing UQ methods are typically benchmarked only on unambiguous tasks, creating a gap between research and practical deployment needs.

Method: Introduced MAQA* and AmbigQA* datasets with ground-truth answer distributions from factual co-occurrence, and tested various UQ estimators including predictive distribution, internal representations, and model ensembles.

Result: All tested uncertainty estimators performed close to random on ambiguous data, with theoretical analysis showing fundamental limitations of predictive-distribution and ensemble-based methods under ambiguity.

Conclusion: Current UQ methods for LLMs have a critical shortcoming in handling ambiguity, requiring a rethinking of modeling paradigms for trustworthy deployment.

Abstract: Accurate uncertainty quantification (UQ) in Large Language Models (LLMs) is critical for trustworthy deployment. While real-world language is inherently ambiguous, reflecting aleatoric uncertainty, existing UQ methods are typically benchmarked against tasks with no ambiguity. In this work, we demonstrate that while current uncertainty estimators perform well under the restrictive assumption of no ambiguity, they degrade to close-to-random performance on ambiguous data. To this end, we introduce MAQA* and AmbigQA*, the first ambiguous question-answering (QA) datasets equipped with ground-truth answer distributions estimated from factual co-occurrence. We find this performance deterioration to be consistent across different estimation paradigms: using the predictive distribution itself, internal representations throughout the model, and an ensemble of models. We show that this phenomenon can be theoretically explained, revealing that predictive-distribution and ensemble-based estimators are fundamentally limited under ambiguity. Overall, our study reveals a key shortcoming of current UQ methods for LLMs and motivates a rethinking of current modeling paradigms.

[290] On Joint Regularization and Calibration in Deep Ensembles

Laurits Fredsgaard, Mikkel N. Schmidt

Main category: cs.LG

TL;DR: Jointly tuning deep ensembles improves performance and uncertainty calibration compared to individual tuning, with a proposed overlapping holdout strategy offering practical benefits.

Details

Motivation: Deep ensembles typically train models individually, but evidence suggests joint tuning can improve performance and uncertainty quantification.

Method: Proposed jointly tuning weight decay, temperature scaling, and early stopping for ensembles, with a partially overlapping holdout strategy to balance joint evaluation and data usage.

Result: Joint tuning generally matches or improves performance with varying effect sizes across tasks, showing trade-offs between individual and joint optimization.

Conclusion: The overlapping holdout strategy provides a practical solution for optimizing deep ensembles, offering valuable guidance for practitioners.

Abstract: Deep ensembles are a powerful tool in machine learning, improving both model performance and uncertainty calibration. While ensembles are typically formed by training and tuning models individually, evidence suggests that jointly tuning the ensemble can lead to better performance. This paper investigates the impact of jointly tuning weight decay, temperature scaling, and early stopping on both predictive performance and uncertainty quantification. Additionally, we propose a partially overlapping holdout strategy as a practical compromise between enabling joint evaluation and maximizing the use of data for training. Our results demonstrate that jointly tuning the ensemble generally matches or improves performance, with significant variation in effect size across different tasks and metrics. We highlight the trade-offs between individual and joint optimization in deep ensemble training, with the overlapping holdout strategy offering an attractive practical solution. We believe our findings provide valuable insights and guidance for practitioners looking to optimize deep ensemble models. Code is available at: https://github.com/lauritsf/ensemble-optimality-gap

[291] Ground-Truth Subgraphs for Better Training and Evaluation of Knowledge Graph Augmented LLMs

Alberto Cattaneo, Carlo Luschi, Daniel Justus

Main category: cs.LG

TL;DR: SynthKGQA is a framework for generating synthetic Knowledge Graph Question Answering datasets with full ground-truth facts, enabling better benchmarking and training of KG retrievers.

Details

Motivation: There's a lack of challenging QA datasets with ground-truth targets for graph retrieval, making comparison of methods difficult.

Method: Developed SynthKGQA framework to generate synthetic KGQA datasets from any Knowledge Graph, providing complete ground-truth facts for reasoning.

Result: Created GTSQA dataset from Wikidata to test zero-shot generalization of KG retrievers on unseen graph structures and relation types.

Conclusion: SynthKGQA enables more informative benchmarking of KG retrievers and allows training better models through high-quality synthetic data.

Abstract: Retrieval of information from graph-structured knowledge bases represents a promising direction for improving the factuality of LLMs. While various solutions have been proposed, a comparison of methods is difficult due to the lack of challenging QA datasets with ground-truth targets for graph retrieval. We present SynthKGQA, a framework for generating high-quality synthetic Knowledge Graph Question Answering datasets from any Knowledge Graph, providing the full set of ground-truth facts in the KG to reason over each question. We show how, in addition to enabling more informative benchmarking of KG retrievers, the data produced with SynthKGQA also allows us to train better models. We apply SynthKGQA to Wikidata to generate GTSQA, a new dataset designed to test zero-shot generalization abilities of KG retrievers with respect to unseen graph structures and relation types, and benchmark popular solutions for KG-augmented LLMs on it.

[292] ScaleDL: Towards Scalable and Efficient Runtime Prediction for Distributed Deep Learning Workloads

Xiaokai Wang, Shaoyuan Huang, Yuting Li, Xiaofei Wang

Main category: cs.LG

TL;DR: ScaleDL is a runtime prediction framework for DNN workloads that combines nonlinear layer-wise modeling with GNN-based cross-layer interactions, achieving 6x lower MRE and 5x lower RMSE compared to baselines while reducing data collection costs.

Details

Motivation: As DNN models grow in size and complexity, accurate runtime prediction becomes crucial for optimizing development and resource allocation. Traditional additive models lack accuracy and generalizability, while graph-enhanced methods incur high data collection costs.

Method: Proposes ScaleDL framework with nonlinear layer-wise modeling and GNN-based cross-layer interaction mechanism, using D-optimal method to reduce data collection costs.

Result: Experiments on five popular DNN models show ScaleDL achieves 6x lower Mean Relative Error (MRE) and 5x lower Root Mean Square Error (RMSE) compared to baseline models.

Conclusion: ScaleDL successfully balances accuracy, generalizability, and data collection costs, providing an effective solution for DNN runtime prediction across different network architectures.

Abstract: Deep neural networks (DNNs) form the cornerstone of modern AI services, supporting a wide range of applications, including autonomous driving, chatbots, and recommendation systems. As models increase in size and complexity, DNN workloads like training and inference tasks impose unprecedented demands on distributed computing resources, making the accurate prediction of runtime essential for optimizing development and resource allocation. Traditional methods rely on additive computational unit models, limiting their accuracy and generalizability. In contrast, graph-enhanced modeling improves performance but significantly increases data collection costs. Therefore, there is a critical need for a method that strikes a balance between accuracy, generalizability, and the costs of data collection. To address these challenges, we propose ScaleDL, a novel runtime prediction framework that combines nonlinear layer-wise modeling with graph neural network (GNN)-based cross-layer interaction mechanism, enabling accurate DNN runtime prediction and hierarchical generalizability across different network architectures. Additionally, we employ the D-optimal method to reduce data collection costs. Experiments on the workloads of five popular DNN models prove that ScaleDL enhances runtime prediction accuracy and generalizability, achieving 6$\times$ lower MRE and 5$\times$ lower RMSE compared to baseline models.

[293] The Strong Lottery Ticket Hypothesis for Multi-Head Attention Mechanisms

Hikari Otsuka, Daiki Chijiwa, Yasuyuki Okoshi, Daichi Fujiki, Susumu Takeuchi, Masato Motomura

Main category: cs.LG

TL;DR: The paper proves the strong lottery ticket hypothesis for transformer architectures, specifically for multi-head attention mechanisms, showing that randomly initialized transformers contain subnetworks that can approximate arbitrary target transformers.

Details

Motivation: While the strong lottery ticket hypothesis has been established for various neural architectures, it lacked theoretical understanding for transformer architectures, particularly for the multi-head attention mechanism which is core to transformers.

Method: Theoretical analysis proving that randomly initialized multi-head attention with sufficient hidden dimension contains strong lottery tickets that approximate arbitrary target MHAs. Extended this to transformers without normalization layers.

Result: Proved that with hidden dimension O(d log(Hd^{3/2})) for key and value, a randomly initialized MHA contains an SLT that approximates an arbitrary MHA with high probability. Empirical validation shows approximation error decreases exponentially with hidden dimension.

Conclusion: The strong lottery ticket hypothesis holds for transformer architectures, including multi-head attention mechanisms, providing theoretical foundation for pruning-based approaches in transformers.

Abstract: The strong lottery ticket hypothesis (SLTH) conjectures that high-performing subnetworks, called strong lottery tickets (SLTs), are hidden in randomly initialized neural networks. Although recent theoretical studies have established the SLTH across various neural architectures, the SLTH for transformer architectures still lacks theoretical understanding. In particular, the current theory of the SLTH does not yet account for the multi-head attention (MHA) mechanism, a core component of transformers. To address this gap, we introduce a theoretical analysis of the existence of SLTs within MHAs. We prove that, if a randomly initialized MHA of $H$ heads and input dimension $d$ has the hidden dimension $O(d\log(Hd^{3/2}))$ for the key and value, it contains an SLT that approximates an arbitrary MHA with the same input dimension with high probability. Furthermore, by leveraging this theory for MHAs, we extend the SLTH to transformers without normalization layers. We empirically validate our theoretical findings, demonstrating that the approximation error between the SLT within a source model (MHA and transformer) and an approximate target counterpart decreases exponentially by increasing the hidden dimension of the source model.

[294] seqme: a Python library for evaluating biological sequence design

Rasmus Møller-Larsen, Adam Izdebski, Jan Olszewski, Pankhil Gawade, Michal Kmicikiewicz, Wojciech Zarzecki, Ewa Szczurek

Main category: cs.LG

TL;DR: seqme is a modular Python library for evaluating computational methods in biological sequence design, offering model-agnostic metrics across sequence-based, embedding-based, and property-based categories for various biological sequences.

Details

Motivation: There was a lack of a single software library implementing metrics to evaluate computational methods for designing biological sequences, despite recent advances in the field.

Method: Developed seqme as a modular and extendable open-source Python library with three groups of metrics: sequence-based, embedding-based, and property-based. The library includes embedding and property models, diagnostics, and visualization functions.

Result: Created a comprehensive library applicable to various biological sequences (small molecules, DNA, ncRNA, mRNA, peptides, proteins) that can evaluate both one-shot and iterative computational design methods.

Conclusion: seqme fills an important gap by providing a unified toolkit for evaluating biological sequence design methods, enabling standardized assessment across different computational approaches and sequence types.

Abstract: Recent advances in computational methods for designing biological sequences have sparked the development of metrics to evaluate these methods performance in terms of the fidelity of the designed sequences to a target distribution and their attainment of desired properties. However, a single software library implementing these metrics was lacking. In this work we introduce seqme, a modular and highly extendable open-source Python library, containing model-agnostic metrics for evaluating computational methods for biological sequence design. seqme considers three groups of metrics: sequence-based, embedding-based, and property-based, and is applicable to a wide range of biological sequences: small molecules, DNA, ncRNA, mRNA, peptides and proteins. The library offers a number of embedding and property models for biological sequences, as well as diagnostics and visualization functions to inspect the results. seqme can be used to evaluate both one-shot and iterative computational design methods.

[295] Guided by Stars: Interpretable Concept Learning Over Time Series via Temporal Logic Semantics

Irene Ferfoglia, Simone Silvetti, Gaia Saveri, Laura Nenzi, Luca Bortolussi

Main category: cs.LG

TL;DR: STELLE is a neuro-symbolic framework for time series classification that embeds trajectories into temporal logic concepts, providing both accurate predictions and human-readable logical explanations.

Details

Motivation: Time series classification is critical in safety-critical applications but current deep learning methods are black-box, making it hard to understand their reasoning. There's a need for interpretable models that can provide human-understandable explanations.

Method: Uses Signal Temporal Logic (STL) embedding with a novel STL-inspired kernel that maps time series to their alignment with predefined STL formulae. Jointly optimizes classification accuracy and interpretability by providing logical concepts for predictions.

Result: STELLE achieves competitive accuracy on diverse real-world benchmarks while providing logically faithful explanations, including local explanations (STL conditions for individual predictions) and global explanations (class-characterizing formulae).

Conclusion: The framework successfully unifies classification and explanation through temporal logic embedding, offering both accurate predictions and interpretable, human-readable logical justifications for time series classification tasks.

Abstract: Time series classification is a task of paramount importance, as this kind of data often arises in safety-critical applications. However, it is typically tackled with black-box deep learning methods, making it hard for humans to understand the rationale behind their output. To take on this challenge, we propose a novel approach, STELLE (Signal Temporal logic Embedding for Logically-grounded Learning and Explanation), a neuro-symbolic framework that unifies classification and explanation through direct embedding of trajectories into a space of temporal logic concepts. By introducing a novel STL-inspired kernel that maps raw time series to their alignment with predefined STL formulae, our model jointly optimises accuracy and interpretability, as each prediction is accompanied by the most relevant logical concepts that characterise it. This yields (i) local explanations as human-readable STL conditions justifying individual predictions, and (ii) global explanations as class-characterising formulae. Experiments demonstrate that STELLE achieves competitive accuracy while providing logically faithful explanations, validated on diverse real-world benchmarks.

[296] Efficient Reinforcement Learning from Human Feedback via Bayesian Preference Inference

Matteo Cercola, Valeria Capretti, Simone Formentin

Main category: cs.LG

TL;DR: A hybrid framework combining RLHF’s scalability with PBO’s sample efficiency through active querying in preference learning.

Details

Motivation: Human preference data collection is costly and time-consuming, requiring more efficient learning paradigms that leverage the complementary strengths of existing approaches.

Method: Integration of an acquisition-driven module into the RLHF pipeline to enable active and sample-efficient preference gathering, unifying RLHF’s scalability with PBO’s query efficiency.

Result: Experimental validation on high-dimensional preference optimization and LLM fine-tuning shows consistent improvements in both sample efficiency and overall performance.

Conclusion: The proposed hybrid framework successfully combines the advantages of RLHF and PBO, achieving better sample efficiency while maintaining scalability for preference learning tasks.

Abstract: Learning from human preferences is a cornerstone of aligning machine learning models with subjective human judgments. Yet, collecting such preference data is often costly and time-consuming, motivating the need for more efficient learning paradigms. Two established approaches offer complementary advantages: RLHF scales effectively to high-dimensional tasks such as LLM fine-tuning, while PBO achieves greater sample efficiency through active querying. We propose a hybrid framework that unifies RLHF’s scalability with PBO’s query efficiency by integrating an acquisition-driven module into the RLHF pipeline, thereby enabling active and sample-efficient preference gathering. We validate the proposed approach on two representative domains: (i) high-dimensional preference optimization and (ii) LLM fine-tuning. Experimental results demonstrate consistent improvements in both sample efficiency and overall performance across these tasks.

[297] Differentially Private In-Context Learning with Nearest Neighbor Search

Antti Koskela, Tejas Kulkarni, Laith Zumot

Main category: cs.LG

TL;DR: A differentially private framework for in-context learning that integrates nearest neighbor search with privacy-aware sample selection, outperforming existing methods across multiple benchmarks.

Details

Motivation: Existing DP-ICL approaches overlook the similarity search component in modern LLM pipelines, creating privacy risks in context data retrieval.

Method: Uses nearest neighbor retrieval from context database combined with a privacy filter that tracks cumulative privacy cost of selected samples to maintain central differential privacy budget.

Result: Outperforms existing baselines by substantial margin across all evaluated benchmarks, achieving more favorable privacy-utility trade-offs in text classification and document QA tasks.

Conclusion: The proposed DP framework with privacy-aware nearest neighbor search provides clear advantages over existing methods for differentially private in-context learning.

Abstract: Differentially private in-context learning (DP-ICL) has recently become an active research topic due to the inherent privacy risks of in-context learning. However, existing approaches overlook a critical component of modern large language model (LLM) pipelines: the similarity search used to retrieve relevant context data. In this work, we introduce a DP framework for in-context learning that integrates nearest neighbor search of relevant examples in a privacy-aware manner. Our method outperforms existing baselines by a substantial margin across all evaluated benchmarks, achieving more favorable privacy-utility trade-offs. To achieve this, we employ nearest neighbor retrieval from a database of context data, combined with a privacy filter that tracks the cumulative privacy cost of selected samples to ensure adherence to a central differential privacy budget. Experimental results on text classification and document question answering show a clear advantage of the proposed method over existing baselines.

[298] LUME-DBN: Full Bayesian Learning of DBNs from Incomplete data in Intensive Care

Federico Pirola, Fabio Stella, Marco Grzegorczyk

Main category: cs.LG

TL;DR: A novel Gibbs sampling method for learning Dynamic Bayesian Networks from incomplete clinical data, treating missing values as parameters to enable principled imputation and uncertainty estimation.

Details

Motivation: Existing missing data methods for longitudinal clinical data are derived from static models and fail to properly account for temporal dynamics, limiting uncertainty quantification over time - critical for clinical decision-making in settings like intensive care.

Method: Gibbs sampling-based approach that treats each missing value as an unknown parameter following a Gaussian distribution, sampling unobserved values from their full conditional distributions at each iteration.

Result: Superior reconstruction accuracy and convergence properties compared to standard model-agnostic techniques like MICE, with better performance on both simulated datasets and real-world intensive care data.

Conclusion: The Bayesian approach provides more reliable imputations and deeper insight into model behavior, supporting safer clinical decision-making in settings with frequent and impactful missing data.

Abstract: Dynamic Bayesian networks (DBNs) are increasingly used in healthcare due to their ability to model complex temporal relationships in patient data while maintaining interpretability, an essential feature for clinical decision-making. However, existing approaches to handling missing data in longitudinal clinical datasets are largely derived from static Bayesian networks literature, failing to properly account for the temporal nature of the data. This gap limits the ability to quantify uncertainty over time, which is particularly critical in settings such as intensive care, where understanding the temporal dynamics is fundamental for model trustworthiness and applicability across diverse patient groups. Despite the potential of DBNs, a full Bayesian framework that integrates missing data handling remains underdeveloped. In this work, we propose a novel Gibbs sampling-based method for learning DBNs from incomplete data. Our method treats each missing value as an unknown parameter following a Gaussian distribution. At each iteration, the unobserved values are sampled from their full conditional distributions, allowing for principled imputation and uncertainty estimation. We evaluate our method on both simulated datasets and real-world intensive care data from critically ill patients. Compared to standard model-agnostic techniques such as MICE, our Bayesian approach demonstrates superior reconstruction accuracy and convergence properties. These results highlight the clinical relevance of incorporating full Bayesian inference in temporal models, providing more reliable imputations and offering deeper insight into model behavior. Our approach supports safer and more informed clinical decision-making, particularly in settings where missing data are frequent and potentially impactful.

[299] Spurious Correlation-Aware Embedding Regularization for Worst-Group Robustness

Subeen Park, Joowang Kim, Hakyung Lee, Sunjae Yoo, Kyungwoo Song

Main category: cs.LG

TL;DR: SCER is a novel embedding regularization method that suppresses spurious correlations to improve worst-group robustness in deep learning models, outperforming state-of-the-art approaches.

Details

Motivation: Deep learning models often rely on spurious correlations, making them vulnerable to distribution shifts, especially in subpopulation scenarios where they struggle with underrepresented groups. Existing methods lack a theoretical framework connecting embedding space representations with worst-group error.

Method: Proposes Spurious Correlation-Aware Embedding Regularization (SCER) that directly regularizes feature representations to suppress spurious cues. Identifies spurious and core directions from group-wise mean embedding differences across domains and classes, then imposes theoretical constraints at the embedding level.

Result: SCER outperforms prior state-of-the-art methods in worst-group accuracy across multiple vision and language benchmarks.

Conclusion: By theoretically constraining embedding representations to focus on core features while reducing sensitivity to spurious patterns, SCER effectively improves model robustness in worst-group scenarios.

Abstract: Deep learning models achieve strong performance across various domains but often rely on spurious correlations, making them vulnerable to distribution shifts. This issue is particularly severe in subpopulation shift scenarios, where models struggle in underrepresented groups. While existing methods have made progress in mitigating this issue, their performance gains are still constrained. They lack a rigorous theoretical framework connecting the embedding space representations with worst-group error. To address this limitation, we propose Spurious Correlation-Aware Embedding Regularization for Worst-Group Robustness (SCER), a novel approach that directly regularizes feature representations to suppress spurious cues. We show theoretically that worst-group error is influenced by how strongly the classifier relies on spurious versus core directions, identified from differences in group-wise mean embeddings across domains and classes. By imposing theoretical constraints at the embedding level, SCER encourages models to focus on core features while reducing sensitivity to spurious patterns. Through systematic evaluation on multiple vision and language, we show that SCER outperforms prior state-of-the-art studies in worst-group accuracy. Our code is available at \href{https://github.com/MLAI-Yonsei/SCER}{https://github.com/MLAI-Yonsei/SCER}.

[300] On the Equivalence of Regression and Classification

Jayadeva, Naman Dwivedi, Hari Krishnan, N. M. Anoop Krishnan

Main category: cs.LG

TL;DR: Establishes formal equivalence between regression and classification, showing regression with M samples on a hyperplane equals linearly separable classification with 2M samples, leading to new regression formulation and regressability measure.

Details

Motivation: To create a formal link between regression and classification, which has been tenuous, and justify margin maximization in regression beyond just regularization.

Method: Prove equivalence between regression on hyperplane and linearly separable classification, use margin maximization on equivalent classification task to derive new regression formulation.

Result: Developed regressability measure to estimate dataset difficulty without model training, and used equivalence to train neural networks for linearizing maps.

Conclusion: Established formal equivalence between regression and classification, enabling new regression formulations and practical tools for assessing regression difficulty.

Abstract: A formal link between regression and classification has been tenuous. Even though the margin maximization term $|w|$ is used in support vector regression, it has at best been justified as a regularizer. We show that a regression problem with $M$ samples lying on a hyperplane has a one-to-one equivalence with a linearly separable classification task with $2M$ samples. We show that margin maximization on the equivalent classification task leads to a different regression formulation than traditionally used. Using the equivalence, we demonstrate a ``regressability’’ measure, that can be used to estimate the difficulty of regressing a dataset, without needing to first learn a model for it. We use the equivalence to train neural networks to learn a linearizing map, that transforms input variables into a space where a linear regressor is adequate.

[301] ForecastGAN: A Decomposition-Based Adversarial Framework for Multi-Horizon Time Series Forecasting

Syeda Sitara Wishal Fatima, Afshin Rahimi

Main category: cs.LG

TL;DR: ForecastGAN is a novel decomposition-based adversarial framework for multi-horizon time series forecasting that outperforms transformer models in short-term scenarios while remaining competitive for long-term predictions.

Details

Motivation: Transformer models excel in long-term forecasting but underperform in short-term scenarios and typically ignore categorical features, creating a need for a more balanced and comprehensive approach.

Method: Three integrated modules: Decomposition Module (extracts seasonality and trend), Model Selection Module (identifies optimal neural network configurations), and Adversarial Training Module (enhances robustness through Conditional GAN training).

Result: Outperforms state-of-the-art transformer models on 11 benchmark datasets for short-term forecasting while remaining competitive for long-term horizons.

Conclusion: Establishes a more generalizable approach to time series forecasting that adapts to specific contexts while maintaining strong performance across diverse data characteristics without extensive hyperparameter tuning.

Abstract: Time series forecasting is essential across domains from finance to supply chain management. This paper introduces ForecastGAN, a novel decomposition based adversarial framework addressing limitations in existing approaches for multi-horizon predictions. Although transformer models excel in long-term forecasting, they often underperform in short-term scenarios and typically ignore categorical features. ForecastGAN operates through three integrated modules: a Decomposition Module that extracts seasonality and trend components; a Model Selection Module that identifies optimal neural network configurations based on forecasting horizon; and an Adversarial Training Module that enhances prediction robustness through Conditional Generative Adversarial Network training. Unlike conventional approaches, ForecastGAN effectively integrates both numerical and categorical features. We validate our framework on eleven benchmark multivariate time series datasets that span various forecasting horizons. The results show that ForecastGAN consistently outperforms state-of-the-art transformer models for short-term forecasting while remaining competitive for long-term horizons. This research establishes a more generalizable approach to time series forecasting that adapts to specific contexts while maintaining strong performance across diverse data characteristics without extensive hyperparameter tuning.

[302] Federated Stochastic Minimax Optimization under Heavy-Tailed Noises

Xinwen Zhang, Hongchang Gao

Main category: cs.LG

TL;DR: Proposes two federated learning algorithms (Fed-NSGDA-M and FedMuon-DA) for nonconvex-PL minimax optimization under heavy-tailed gradient noise, achieving O(1/(TNp)^((s-1)/2s)) convergence rate.

Details

Motivation: Heavy-tailed noise is more realistic than standard bounded variance assumptions in real-world applications, and federated minimax optimization under such noise conditions lacks theoretical guarantees.

Method: Fed-NSGDA-M uses normalized gradients, while FedMuon-DA employs the Muon optimizer for local updates. Both are designed to handle heavy-tailed noise with milder conditions.

Result: Both algorithms achieve convergence rate of O(1/(TNp)^((s-1)/2s)) under heavy-tailed noise, providing the first theoretical guarantees for federated minimax optimization in such settings.

Conclusion: The proposed algorithms effectively address heavy-tailed noise in federated minimax optimization with rigorous theoretical foundations and empirical validation.

Abstract: Heavy-tailed noise has attracted growing attention in nonconvex stochastic optimization, as numerous empirical studies suggest it offers a more realistic assumption than standard bounded variance assumption. In this work, we investigate nonconvex-PL minimax optimization under heavy-tailed gradient noise in federated learning. We propose two novel algorithms: Fed-NSGDA-M, which integrates normalized gradients, and FedMuon-DA, which leverages the Muon optimizer for local updates. Both algorithms are designed to effectively address heavy-tailed noise in federated minimax optimization, under a milder condition. We theoretically establish that both algorithms achieve a convergence rate of $O({1}/{(TNp)^{\frac{s-1}{2s}}})$. To the best of our knowledge, these are the first federated minimax optimization algorithms with rigorous theoretical guarantees under heavy-tailed noise. Extensive experiments further validate their effectiveness.

[303] Towards Causal Market Simulators

Dennis Thumm, Luis Ontaneda Mijares

Main category: cs.LG

TL;DR: TNCM-VAE combines VAE with structural causal models to generate counterfactual financial time series that preserve temporal dependencies and causal relationships, outperforming existing methods in counterfactual probability estimation.

Details

Motivation: Existing market generators lack causal reasoning capabilities needed for counterfactual analysis and risk assessment in financial applications.

Method: Uses variational autoencoders with structural causal models, enforcing causal constraints through DAGs in decoder architecture and employing causal Wasserstein distance for training.

Result: Achieved superior performance with L1 distances of 0.03-0.10 compared to ground truth on synthetic autoregressive models inspired by Ornstein-Uhlenbeck process.

Conclusion: Enables financial stress testing, scenario analysis, and enhanced backtesting by generating plausible counterfactual market trajectories that respect underlying causal mechanisms.

Abstract: Market generators using deep generative models have shown promise for synthetic financial data generation, but existing approaches lack causal reasoning capabilities essential for counterfactual analysis and risk assessment. We propose a Time-series Neural Causal Model VAE (TNCM-VAE) that combines variational autoencoders with structural causal models to generate counterfactual financial time series while preserving both temporal dependencies and causal relationships. Our approach enforces causal constraints through directed acyclic graphs in the decoder architecture and employs the causal Wasserstein distance for training. We validate our method on synthetic autoregressive models inspired by the Ornstein-Uhlenbeck process, demonstrating superior performance in counterfactual probability estimation with L1 distances as low as 0.03-0.10 compared to ground truth. The model enables financial stress testing, scenario analysis, and enhanced backtesting by generating plausible counterfactual market trajectories that respect underlying causal mechanisms.

[304] Q3R: Quadratic Reweighted Rank Regularizer for Effective Low-Rank Training

Ipsita Ghosh, Ethan Nguyen, Christian Kümmerle

Main category: cs.LG

TL;DR: Q3R is a novel low-rank training method that uses quadratic reweighted rank regularization to enable efficient pre-training and fine-tuning of models with prescribed low ranks, achieving comparable performance to dense models with significant parameter reduction.

Details

Motivation: Existing parameter-efficient training methods based on low-rank optimization fail at low-rank pre-training tasks where maintaining low-rank structure while achieving good performance remains challenging.

Method: Proposes Quadratic Reweighted Rank Regularizer (Q3R) inspired by iteratively reweighted least squares framework, using a quadratic regularizer term that majorizes a smoothed log determinant serving as rank surrogate objective.

Result: Achieved 60% and 80% parameter reduction in ViT-Tiny with only ~1.3% and ~4% accuracy drop on CIFAR-10 respectively. Demonstrated efficacy across Transformers for both image and language tasks, including low-rank fine-tuning.

Conclusion: Q3R enables training weight matrices with prescribed low target ranks while achieving comparable predictive performance to dense models with small computational overhead, remaining fully compatible with existing architectures.

Abstract: Parameter-efficient training, based on low-rank optimization, has become a highly successful tool for fine-tuning large deep-learning models. However, these methods fail at low-rank pre-training tasks where maintaining the low-rank structure and the objective remains a challenging task. We propose the Quadratic Reweighted Rank Regularizer dubbed Q3R, which leads to a novel low-rank inducing training strategy inspired by the iteratively reweighted least squares (IRLS) framework. Q3R is based on a quadratic regularizer term which majorizes a smoothed log determinant serving as rank surrogate objective. Unlike other low-rank training techniques, Q3R is able to train weight matrices with prescribed, low target ranks of models that achieve comparable predictive performance as dense models, with small computational overhead, while remaining fully compatible with existing architectures. For example, we demonstrated one experiment where we are able to truncate $60%$ and $80%$ of the parameters of a ViT-Tiny model with $~1.3%$ and $~4%$ accuracy drop in CIFAR-10 performance respectively. The efficacy of Q3R is confirmed on Transformers across both image and language tasks, including for low-rank fine-tuning.

[305] Distribution-Aware Tensor Decomposition for Compression of Convolutional Neural Networks

Alper Kalle, Theo Rudkiewicz, Mohamed-Oumar Ouerfelli, Mohamed Tamaazousti

Main category: cs.LG

TL;DR: Proposes data-informed tensor compression for neural networks using covariance-based norms that minimize function space error instead of weight-space error, achieving competitive accuracy without fine-tuning.

Details

Motivation: Neural networks require significant computing power, and compression methods typically use isotropic norms in weight space. The authors aim to improve compression by using data-informed norms that measure error in function space rather than weight space.

Method: Use covariance-based norms that minimize change in layer’s output distribution, expressed as ∥(W−W̃)Σ¹ᐟ²∥F where Σ¹ᐟ² is square root of input covariance matrix. Propose alternating least square algorithms for Tucker-2 and CPD tensor decompositions that directly optimize this norm.

Result: Achieves competitive accuracy without post-compression fine-tuning. The covariance-based norm can be transferred between datasets with minor accuracy drop, enabling compression when original training data is unavailable. Validated on ResNet-18/50, GoogLeNet architectures and ImageNet, FGVC-Aircraft, Cifar10/100 datasets.

Conclusion: Data-informed compression using function space norms outperforms conventional weight-space approaches, providing effective compression without requiring fine-tuning and working across different datasets.

Abstract: Neural networks are widely used for image-related tasks but typically demand considerable computing power. Once a network has been trained, however, its memory- and compute-footprint can be reduced by compression. In this work, we focus on compression through tensorization and low-rank representations. Whereas classical approaches search for a low-rank approximation by minimizing an isotropic norm such as the Frobenius norm in weight-space, we use data-informed norms that measure the error in function space. Concretely, we minimize the change in the layer’s output distribution, which can be expressed as $\lVert (W - \widetilde{W}) \Sigma^{1/2}\rVert_F$ where $\Sigma^{1/2}$ is the square root of the covariance matrix of the layer’s input and $W$, $\widetilde{W}$ are the original and compressed weights. We propose new alternating least square algorithms for the two most common tensor decompositions (Tucker-2 and CPD) that directly optimize the new norm. Unlike conventional compression pipelines, which almost always require post-compression fine-tuning, our data-informed approach often achieves competitive accuracy without any fine-tuning. We further show that the same covariance-based norm can be transferred from one dataset to another with only a minor accuracy drop, enabling compression even when the original training dataset is unavailable. Experiments on several CNN architectures (ResNet-18/50, and GoogLeNet) and datasets (ImageNet, FGVC-Aircraft, Cifar10, and Cifar100) confirm the advantages of the proposed method.

[306] Alternative Fairness and Accuracy Optimization in Criminal Justice

Shaolong Wu, James Blume, Geshi Yeung

Main category: cs.LG

TL;DR: The paper proposes a modified group fairness approach for criminal justice that minimizes weighted error loss while maintaining small differences in false negative rates, addressing conflicts between fairness types and providing a practical deployment framework.

Details

Motivation: Algorithmic fairness concepts remain unsettled in criminal justice, with conflicts between group, individual, and process fairness approaches that need resolution.

Method: Develop a modified group fairness approach that minimizes weighted error loss while constraining false negative rate differences within tolerance, and create a practical deployment framework with three pillars: need-based decisions, transparency/accountability, and narrowly tailored solutions.

Result: The proposed method makes solutions easier to find, can improve predictive accuracy, and explicitly surfaces ethical choices about error costs while addressing common critiques like biased data and subgroup constraints.

Conclusion: The approach links technical design to legitimacy and provides actionable guidance for agencies using risk assessment tools, offering a practical framework for deploying algorithmic fairness in public decision systems.

Abstract: Algorithmic fairness has grown rapidly as a research area, yet key concepts remain unsettled, especially in criminal justice. We review group, individual, and process fairness and map the conditions under which they conflict. We then develop a simple modification to standard group fairness. Rather than exact parity across protected groups, we minimize a weighted error loss while keeping differences in false negative rates within a small tolerance. This makes solutions easier to find, can raise predictive accuracy, and surfaces the ethical choice of error costs. We situate this proposal within three classes of critique: biased and incomplete data, latent affirmative action, and the explosion of subgroup constraints. Finally, we offer a practical framework for deployment in public decision systems built on three pillars: need-based decisions, Transparency and accountability, and narrowly tailored definitions and solutions. Together, these elements link technical design to legitimacy and provide actionable guidance for agencies that use risk assessment and related tools.

[307] Linear Mode Connectivity under Data Shifts for Deep Ensembles of Image Classifiers

C. Hepburn, T. Zielke, A. P. Raulf

Main category: cs.LG

TL;DR: This paper studies linear mode connectivity (LMC) under data shifts, finding that data shifts act as gradient noise that can be mitigated with small learning rates and large batch sizes. LMC helps balance training efficiency against ensemble diversity benefits.

Details

Motivation: To understand how linear mode connectivity relates to deep learning aspects like training stability, model generalization, and how data shifts affect these relationships.

Method: Experimental study of LMC under data shifts, interpreting data shifts as gradient noise and testing mitigation through learning rate and batch size parameters.

Result: Data shifts can be reduced through small learning rates and large batch sizes. LMC-connected models make similar errors but offer efficiency-ensemble tradeoffs.

Conclusion: LMC provides a framework for balancing training efficiency against ensemble diversity, with data shifts manageable through proper training parameters.

Abstract: The phenomenon of linear mode connectivity (LMC) links several aspects of deep learning, including training stability under noisy stochastic gradients, the smoothness and generalization of local minima (basins), the similarity and functional diversity of sampled models, and architectural effects on data processing. In this work, we experimentally study LMC under data shifts and identify conditions that mitigate their impact. We interpret data shifts as an additional source of stochastic gradient noise, which can be reduced through small learning rates and large batch sizes. These parameters influence whether models converge to the same local minimum or to regions of the loss landscape with varying smoothness and generalization. Although models sampled via LMC tend to make similar errors more frequently than those converging to different basins, the benefit of LMC lies in balancing training efficiency against the gains achieved from larger, more diverse ensembles. Code and supplementary materials will be made publicly available at https://github.com/DLR-KI/LMC in due course.

[308] Comparing EPGP Surrogates and Finite Elements Under Degree-of-Freedom Parity

Obed Amo, Samit Ghosh, Markus Lange-Hegermann, Bogdan Raiţă, Michael Pokojovy

Main category: cs.LG

TL;DR: B-EPGP surrogate outperforms CN-FEM for 2D wave equation with homogeneous Dirichlet BCs, achieving ~2 orders of magnitude higher accuracy under matched degrees of freedom.

Details

Motivation: To compare boundary-constrained Ehrenpreis-Palamodov Gaussian Process (B-EPGP) surrogate with classical finite element method (CN-FEM) for solving 2D wave equation with homogeneous Dirichlet boundary conditions.

Method: B-EPGP uses exponential-polynomial bases from characteristic variety to exactly enforce PDE and boundary conditions, with penalized least squares for coefficient estimation. Introduced DoF matching protocol for fair comparison.

Result: Under matched DoF, B-EPGP consistently achieves lower space-time L²-error and maximum-in-time L²-error in space than CN-FEM, improving accuracy by roughly two orders of magnitude.

Conclusion: B-EPGP surrogate significantly outperforms classical CN-FEM method for 2D wave equation problems with homogeneous Dirichlet boundary conditions.

Abstract: We present a new benchmarking study comparing a boundary-constrained Ehrenpreis–Palamodov Gaussian Process (B-EPGP) surrogate with a classical finite element method combined with Crank–Nicolson time stepping (CN-FEM) for solving the two-dimensional wave equation with homogeneous Dirichlet boundary conditions. The B-EPGP construction leverages exponential-polynomial bases derived from the characteristic variety to enforce the PDE and boundary conditions exactly and employs penalized least squares to estimate the coefficients. To ensure fairness across paradigms, we introduce a degrees-of-freedom (DoF) matching protocol. Under matched DoF, B-EPGP consistently attains lower space-time $L^2$-error and maximum-in-time $L^{2}$-error in space than CN-FEM, improving accuracy by roughly two orders of magnitude.

[309] End-to-End Reinforcement Learning of Koopman Models for eNMPC of an Air Separation Unit

Daniel Mayfrank, Kayra Dernek, Laura Lang, Alexander Mitsos, Manuel Dahmen

Main category: cs.LG

TL;DR: The paper demonstrates that a reinforcement learning-based method for training Koopman surrogate models scales effectively to a large-scale air separation unit for economic nonlinear model predictive control, achieving similar economic performance while avoiding constraint violations compared to system identification-based approaches.

Details

Motivation: To extend the previously proposed reinforcement learning method for Koopman surrogate models from small-scale case studies to more challenging large-scale industrial applications, specifically demand response in air separation units with limited observable variables.

Method: Uses reinforcement learning to train Koopman surrogate models specifically optimized for economic nonlinear model predictive control applications, applied to a large-scale single-product (nitrogen) air separation unit with only a few realistically measurable plant variables.

Result: The method scales well to the large-scale case study and delivers similar economic performance to system identification-based Koopman eNMPC, but successfully avoids constraint violations that frequently occurred with the purely system identification-based approach.

Conclusion: The reinforcement learning-based approach for training Koopman surrogate models is effective for large-scale industrial applications, providing robust constraint satisfaction while maintaining economic performance in economic nonlinear model predictive control.

Abstract: With our recently proposed method based on reinforcement learning (Mayfrank et al. (2024), Comput. Chem. Eng. 190), Koopman surrogate models can be trained for optimal performance in specific (economic) nonlinear model predictive control ((e)NMPC) applications. So far, our method has exclusively been demonstrated on a small-scale case study. Herein, we show that our method scales well to a more challenging demand response case study built on a large-scale model of a single-product (nitrogen) air separation unit. Across all numerical experiments, we assume observability of only a few realistically measurable plant variables. Compared to a purely system identification-based Koopman eNMPC, which generates small economic savings but frequently violates constraints, our method delivers similar economic performance while avoiding constraint violations.

[310] Uncertainty Quantification for Reduced-Order Surrogate Models Applied to Cloud Microphysics

Jonas E. Katona, Emily K. de Jong, Nipun Gunawardena

Main category: cs.LG

TL;DR: A model-agnostic framework for uncertainty quantification in reduced-order models using conformal prediction to generate prediction intervals for latent dynamics, reconstruction, and end-to-end predictions.

Details

Motivation: Existing uncertainty quantification methods for ROMs are architecture- or training-specific, limiting flexibility and generalization.

Method: Post hoc framework using conformal prediction that requires no modification to the underlying ROM architecture or training procedure.

Result: Successfully demonstrated on a latent space dynamical model for cloud microphysics, accurately predicting droplet-size distribution evolution and quantifying uncertainty across the ROM pipeline.

Conclusion: The proposed method provides robust, model-agnostic uncertainty quantification for reduced-order models without requiring architectural changes or specialized training.

Abstract: Reduced-order models (ROMs) can efficiently simulate high-dimensional physical systems, but lack robust uncertainty quantification methods. Existing approaches are frequently architecture- or training-specific, which limits flexibility and generalization. We introduce a post hoc, model-agnostic framework for predictive uncertainty quantification in latent space ROMs that requires no modification to the underlying architecture or training procedure. Using conformal prediction, our approach estimates statistical prediction intervals for multiple components of the ROM pipeline: latent dynamics, reconstruction, and end-to-end predictions. We demonstrate the method on a latent space dynamical model for cloud microphysics, where it accurately predicts the evolution of droplet-size distributions and quantifies uncertainty across the ROM pipeline.

[311] Integrating Temporal and Structural Context in Graph Transformers for Relational Deep Learning

Divyansha Lachi, Mahmoud Mohammadi, Joe Meyer, Vinam Arora, Tom Palczewski, Eva L. Dyer

Main category: cs.LG

TL;DR: RGP is a graph transformer that integrates spatial and temporal dependencies in relational data using cross-attention and latent bottlenecks, supporting multi-task learning across domains like healthcare and finance.

Details

Motivation: Existing graph models focus mainly on spatial structure, treat temporal information as filtering constraints rather than modeling signals, and are designed for single-task prediction, limiting their utility in dynamic relational systems.

Method: Proposes temporal subgraph sampler for global context and Relational Graph Perceiver (RGP) with cross-attention-based latent bottleneck to integrate structural and temporal contexts, plus flexible decoder for multi-task learning.

Result: RGP achieves state-of-the-art performance on RelBench, SALT, and CTU benchmarks, demonstrating effectiveness in capturing complex temporal dynamics and supporting diverse predictive tasks.

Conclusion: RGP provides a general and scalable solution for relational deep learning that effectively handles both spatial and temporal dependencies while supporting multiple predictive tasks within a single model.

Abstract: In domains such as healthcare, finance, and e-commerce, the temporal dynamics of relational data emerge from complex interactions-such as those between patients and providers, or users and products across diverse categories. To be broadly useful, models operating on these data must integrate long-range spatial and temporal dependencies across diverse types of entities, while also supporting multiple predictive tasks. However, existing graph models for relational data primarily focus on spatial structure, treating temporal information merely as a filtering constraint to exclude future events rather than a modeling signal, and are typically designed for single-task prediction. To address these gaps, we introduce a temporal subgraph sampler that enhances global context by retrieving nodes beyond the immediate neighborhood to capture temporally relevant relationships. In addition, we propose the Relational Graph Perceiver (RGP), a graph transformer architecture for relational deep learning that leverages a cross-attention-based latent bottleneck to efficiently integrate information from both structural and temporal contexts. This latent bottleneck integrates signals from different node and edge types into a common latent space, enabling the model to build global context across the entire relational system. RGP also incorporates a flexible cross-attention decoder that supports joint learning across tasks with disjoint label spaces within a single model. Experiments on RelBench, SALT, and CTU show that RGP delivers state-of-the-art performance, offering a general and scalable solution for relational deep learning with support for diverse predictive tasks.

[312] ARETE: an R package for Automated REtrieval from TExt with large language models

Vasco V. Branco, Jandó Benedek, Lidia Pivovarova, Luís Correia, Pedro Cardoso

Main category: cs.LG

TL;DR: ARETE is an R package that uses large language models to automate extraction of species occurrence data from scientific literature, significantly expanding known species ranges and enabling more efficient conservation planning.

Details

Motivation: Lack of machine-readable species occurrence data and the slow manual extraction process hinder conservation efforts, especially given the urgent need for rapid data collection due to anthropogenic pressures.

Method: Developed ARETE R package using chatGPT API for automated data extraction, integrating OCR, outlier detection, and tabular output, with validation through comparison with human annotators.

Result: Testing on 100 spider species showed ARETE expanded known Extent of Occurrence by three orders of magnitude, revealing previously undocumented areas where species were historically found.

Conclusion: ARETE enables faster access to untapped occurrence data, allowing researchers to prioritize manual verification while automating extraction for most species, potentially revolutionizing conservation data workflows.

Abstract: 1. A hard stop for the implementation of rigorous conservation initiatives is our lack of key species data, especially occurrence data. Furthermore, researchers have to contend with an accelerated speed at which new information must be collected and processed due to anthropogenic activity. Publications ranging from scientific papers to gray literature contain this crucial information but their data are often not machine-readable, requiring extensive human work to be retrieved. 2. We present the ARETE R package, an open-source software aiming to automate data extraction of species occurrences powered by large language models, namely using the chatGPT Application Programming Interface. This R package integrates all steps of the data extraction and validation process, from Optical Character Recognition to detection of outliers and output in tabular format. Furthermore, we validate ARETE through systematic comparison between what is modelled and the work of human annotators. 3. We demonstrate the usefulness of the approach by comparing range maps produced using GBIF data and with those automatically extracted for 100 species of spiders. Newly extracted data allowed to expand the known Extent of Occurrence by a mean three orders of magnitude, revealing new areas where the species were found in the past, which mayhave important implications for spatial conservation planning and extinction risk assessments. 4. ARETE allows faster access to hitherto untapped occurrence data, a potential game changer in projects requiring such data. Researchers will be able to better prioritize resources, manually verifying selected species while maintaining automated extraction for the majority. This workflow also allows predicting available bibliographic data during project planning.

[313] Complexity as Advantage: A Regret-Based Perspective on Emergent Structure

Oshri Naparstek

Main category: cs.LG

TL;DR: CAA framework defines system complexity relative to observers based on predictive regret, showing complexity arises when systems are easy for some observers but hard for others, creating information advantage.

Details

Motivation: To provide a unified framework for understanding complexity not as intrinsic property but as observer-dependent phenomenon that creates functional value through information advantages.

Method: Define complexity through predictive regret across different observers, demonstrate with simple dynamical models, and analyze implications for learning, evolution, and AI agents.

Result: Shows that complexity emerges when systems induce differentiated predictive regret across observers, unifying concepts like multiscale entropy and predictive information.

Conclusion: Interesting systems are those that create differentiated regret across observers, providing quantitative basis for why complexity can be functionally valuable in learning and evolution.

Abstract: We introduce Complexity as Advantage (CAA), a framework that defines the complexity of a system relative to a family of observers. Instead of measuring complexity as an intrinsic property, we evaluate how much predictive regret a system induces for different observers attempting to model it. A system is complex when it is easy for some observers and hard for others, creating an information advantage. We show that this formulation unifies several notions of emergent behavior, including multiscale entropy, predictive information, and observer-dependent structure. The framework suggests that “interesting” systems are those positioned to create differentiated regret across observers, providing a quantitative grounding for why complexity can be functionally valuable. We demonstrate the idea through simple dynamical models and discuss implications for learning, evolution, and artificial agents.

[314] Addressing divergent representations from causal interventions on neural networks

Satchel Grant, Simon Jerome Han, Alexa Tartaglini, Christopher Potts

Main category: cs.LG

TL;DR: Causal interventions in mechanistic interpretability often create out-of-distribution representations, which can lead to unfaithful explanations. The paper identifies harmless vs pernicious divergences and proposes a modified regularization method to mitigate harmful effects.

Details

Motivation: To investigate whether causal intervention techniques in mechanistic interpretability create out-of-distribution representations that may compromise the faithfulness of explanations to the model's natural state.

Method: Empirical demonstration of distribution shifts from interventions, theoretical analysis of divergence types (harmless vs pernicious), and modification of Counterfactual Latent loss for regularization to keep interventions closer to natural distributions.

Result: Common intervention techniques do shift representations away from natural distributions. The modified CL loss successfully reduces harmful divergences while preserving interpretive power.

Conclusion: The findings highlight the need for more reliable interpretability methods and provide a path forward through regularization techniques that maintain intervention effectiveness while reducing distributional divergence.

Abstract: A common approach to mechanistic interpretability is to causally manipulate model representations via targeted interventions in order to understand what those representations encode. Here we ask whether such interventions create out-of-distribution (divergent) representations, and whether this raises concerns about how faithful their resulting explanations are to the target model in its natural state. First, we demonstrate empirically that common causal intervention techniques often do shift internal representations away from the natural distribution of the target model. Then, we provide a theoretical analysis of two classes of such divergences: harmless' divergences that occur in the null-space of the weights and from covariance within behavioral decision boundaries, and pernicious’ divergences that activate hidden network pathways and cause dormant behavioral changes. Finally, in an effort to mitigate the pernicious cases, we modify the Counterfactual Latent (CL) loss from Grant (2025) that regularizes interventions to remain closer to the natural distributions, reducing the likelihood of harmful divergences while preserving the interpretive power of interventions. Together, these results highlight a path towards more reliable interpretability methods.

[315] Environment Agnostic Goal-Conditioning, A Study of Reward-Free Autonomous Learning

Hampus Åström, Elin Anna Topp, Jacek Malec

Main category: cs.LG

TL;DR: Transforming RL environments into goal-conditioned setups enables autonomous, reward-free learning where agents select their own goals with training times comparable to guided RL.

Details

Motivation: To enable agents to learn tasks autonomously without external rewards by selecting their own goals in an environment-agnostic manner.

Method: Transform regular RL environments into goal-conditioned environments, allowing agents to autonomously select goals independent of the underlying off-policy learning algorithm.

Result: Average goal success rate improves and stabilizes, though individual goal performance may fluctuate. Agents can be instructed to seek any environment observations.

Conclusion: Goal-conditioned environment transformation enables generic agent training prior to specific use cases, achieving autonomous learning comparable to guided RL.

Abstract: In this paper we study how transforming regular reinforcement learning environments into goal-conditioned environments can let agents learn to solve tasks autonomously and reward-free. We show that an agent can learn to solve tasks by selecting its own goals in an environment-agnostic way, at training times comparable to externally guided reinforcement learning. Our method is independent of the underlying off-policy learning algorithm. Since our method is environment-agnostic, the agent does not value any goals higher than others, leading to instability in performance for individual goals. However, in our experiments, we show that the average goal success rate improves and stabilizes. An agent trained with this method can be instructed to seek any observations made in the environment, enabling generic training of agents prior to specific use cases.

[316] Efficient probabilistic surrogate modeling techniques for partially-observed large-scale dynamical systems

Hans Harder, Abhijeet Vishwasrao, Luca Guastoni, Ricardo Vinuesa, Sebastian Peitz

Main category: cs.LG

TL;DR: This paper compares various flow matching extensions for faster sampling in PDE-based dynamical system forecasting, including direct distillation, progressive distillation, adversarial diffusion distillation, Wasserstein GANs, and rectified flows.

Details

Motivation: To develop probabilistic forecasting techniques for dynamical systems described by partial differential equations (like Navier-Stokes) with reduced sampling steps, enabling efficient inflow generation for solvers.

Method: Investigates and compares multiple flow matching extensions: direct distillation, progressive distillation, adversarial diffusion distillation, Wasserstein GANs, and rectified flows. Experiments conducted on challenging systems including direct prediction of 2D slices from large-scale 3D simulations.

Result: The paper provides comparative analysis of different flow matching techniques for reducing sampling steps in PDE-based dynamical system forecasting.

Conclusion: Various flow matching extensions offer promising approaches for efficient probabilistic forecasting of dynamical systems, with potential applications in generating inflow conditions for numerical solvers.

Abstract: This paper is concerned with probabilistic techniques for forecasting dynamical systems described by partial differential equations (such as, for example, the Navier-Stokes equations). In particular, it is investigating and comparing various extensions to the flow matching paradigm that reduce the number of sampling steps. In this regard, it compares direct distillation, progressive distillation, adversarial diffusion distillation, Wasserstein GANs and rectified flows. Moreover, experiments are conducted on a set of challenging systems. In particular, we also address the challenge of directly predicting 2D slices of large-scale 3D simulations, paving the way for efficient inflow generation for solvers.

[317] Optimal Inference Schedules for Masked Diffusion Models

Sitan Chen, Kevin Cong, Jerry Li

Main category: cs.LG

TL;DR: This paper provides a rigorous analysis of parallel sampling capabilities in masked diffusion language models, establishing exact bounds on sampling divergence and connecting it to function approximation theory.

Details

Motivation: Standard auto-regressive LLMs have sequential inference leading to long, costly inference times. Diffusion language models like MDM promise parallel sampling, but there's limited understanding of how much parallelism is possible without performance degradation.

Method: The authors develop a new exact characterization of expected divergence between true and sampled distributions, connecting it to univariate function approximation theory. They derive novel lower and upper bounds using information-theoretic properties like total correlation and dual total correlation.

Result: The analysis shows that optimal unmasking schedules depend heavily on distribution knowledge, but in natural settings, one can sample in O(log n) steps without visible performance loss, where n is sequence length.

Conclusion: While competing with optimal unmasking schedules requires strong distribution knowledge, practical parallel sampling with logarithmic steps is achievable in many natural scenarios using information-theoretic properties of the distribution.

Abstract: A major bottleneck of standard auto-regressive large language models is that their inference process is inherently sequential, resulting in very long and costly inference times. To circumvent this, practitioners proposed a class of language models called diffusion language models, of which the masked diffusion model (MDM) is the most successful. The MDM is able to sample tokens out-of-order and, ostensibly, many tokens at once and in parallel. However, there is very limited rigorous understanding of how much parallel sampling these models can perform without noticeable degradation in their sampling performance. Prior work of Li and Cai obtained some preliminary bounds, but these are not tight for many natural classes of distributions. In this work, we give a new, exact characterization of the expected divergence between the true distribution and the sampled distribution, for any distribution and any unmasking schedule for the sampler, showing an elegant connection to the theory of univariate function approximation. By leveraging this connection, we then attain a number of novel lower and upper bounds for this problem. While the connection to function approximation in principle gives the optimal unmasking schedule for any distribution, we show that it is in general impossible to compete with it without strong a priori knowledge of the distribution, even in seemingly benign settings. However, we also demonstrate new upper bounds and new sampling schedules in terms of well-studied information-theoretic properties of the base distribution, namely, its total correlation and dual total correlation, which show that in some natural settings, one can sample in $O(log n)$ steps without any visible loss in performance, where $n$ is the total sequence length.

[318] TT-Prune: Joint Model Pruning and Resource Allocation for Communication-efficient Time-triggered Federated Learning

Xinlu Zhang, Yansha Deng, Toktam Mahmoodi

Main category: cs.LG

TL;DR: The paper introduces adaptive model pruning to wireless time-triggered federated learning (TT-Fed) systems to address communication bottlenecks and stragglers by jointly optimizing pruning ratio and bandwidth allocation to minimize training loss under latency constraints.

Details

Motivation: Federated learning faces challenges with growing user devices having limited wireless bandwidth, leading to stragglers and high communication overhead. TT-Fed clusters users by time intervals but still suffers from these wireless communication bottlenecks.

Method: Proposed adaptive model pruning in wireless TT-Fed systems, performed convergence analysis on gradient norms, formulated joint optimization problem for pruning ratio and bandwidth allocation, and derived closed-form solutions using KKT conditions.

Result: Simulation results show that model pruning reduces communication cost by 40% while maintaining model performance at the same level.

Conclusion: Adaptive model pruning effectively addresses communication bottlenecks in wireless federated learning systems, achieving significant communication cost reduction without compromising model performance.

Abstract: Federated learning (FL) offers new opportunities in machine learning, particularly in addressing data privacy concerns. In contrast to conventional event-based federated learning, time-triggered federated learning (TT-Fed), as a general form of both asynchronous and synchronous FL, clusters users into different tiers based on fixed time intervals. However, the FL network consists of a growing number of user devices with limited wireless bandwidth, consequently magnifying issues such as stragglers and communication overhead. In this paper, we introduce adaptive model pruning to wireless TT-Fed systems and study the problem of jointly optimizing the pruning ratio and bandwidth allocation to minimize the training loss while ensuring minimal learning latency. To answer this question, we perform convergence analysis on the gradient l_2 norm of the TT-Fed model based on model pruning. Based on the obtained convergence upper bound, a joint optimization problem of pruning ratio and wireless bandwidth is formulated to minimize the model training loss under a given delay threshold. Then, we derive closed-form solutions for wireless bandwidth and pruning ratio using Karush-Kuhn-Tucker(KKT) conditions. The simulation results show that model pruning could reduce the communication cost by 40% while maintaining the model performance at the same level.

[319] Nowcast3D: Reliable precipitation nowcasting via gray-box learning

Huaguan Chen, Wei Han, Haofei Sun, Ning Lin, Xingtao Song, Yunfan Yang, Jie Tian, Yang Liu, Ji-Rong Wen, Xiaoye Zhang, Xueshun Shen, Hao Sun

Main category: cs.LG

TL;DR: A 3D gray-box nowcasting framework that combines physical constraints with data-driven learning for extreme precipitation forecasting, achieving superior accuracy up to 3-hour lead times.

Details

Motivation: Existing methods like NWP and 2D radar-based approaches are limited - NWP is too slow/coarse for rapidly evolving convection, while 2D methods discard crucial vertical information needed for accurate height-dependent dynamics reconstruction.

Method: Hybrid 3D framework that processes volumetric radar reflectivity, learns vertically varying 3D advection fields under conservative operators, parameterizes spatially varying diffusion, adds stochastic terms for unresolved motions, and includes residual branches for small-scale convective initiation.

Result: Achieves more accurate forecasts up to 3-hour lead time across precipitation regimes, ranked first in 57% of cases in blind evaluation by 160 meteorologists.

Conclusion: The framework offers a scalable and robust pathway for skillful and reliable extreme precipitation nowcasting by restoring full 3D dynamics with physical consistency.

Abstract: Extreme precipitation nowcasting demands high spatiotemporal fidelity and extended lead times, yet existing approaches remain limited. Numerical Weather Prediction (NWP) and its deep-learning emulations are too slow and coarse for rapidly evolving convection, while extrapolation and purely data-driven models suffer from error accumulation and excessive smoothing. Hybrid 2D radar-based methods discard crucial vertical information, preventing accurate reconstruction of height-dependent dynamics. We introduce a gray-box, fully three-dimensional nowcasting framework that directly processes volumetric radar reflectivity and couples physically constrained neural operators with datadriven learning. The model learns vertically varying 3D advection fields under a conservative advection operator, parameterizes spatially varying diffusion, and introduces a Brownian-motion–inspired stochastic term to represent unresolved motions. A residual branch captures small-scale convective initiation and microphysical variability, while a diffusion-based stochastic module estimates uncertainty. The framework achieves more accurate forecasts up to three-hour lead time across precipitation regimes and ranked first in 57% of cases in a blind evaluation by 160 meteorologists. By restoring full 3D dynamics with physical consistency, it offers a scalable and robust pathway for skillful and reliable nowcasting of extreme precipitation.

[320] Forgetting is Everywhere

Ben Sanati, Thomas L. Lee, Trevor McInroe, Aidan Scannell, Nikolay Malkin, David Abel, Amos Storkey

Main category: cs.LG

TL;DR: The paper proposes a unified theory of forgetting in machine learning as a lack of self-consistency in predictive distributions, develops a general measure for forgetting propensity, and validates it across diverse learning settings.

Details

Motivation: Addressing the fundamental challenge of catastrophic forgetting in general learning algorithms, where models lose past knowledge when adapting to new data, and the lack of a unified definition to understand forgetting dynamics.

Method: Developed an algorithm- and task-agnostic theory characterizing forgetting as loss of predictive information due to lack of self-consistency in predictive distributions, and designed comprehensive experiments across classification, regression, generative modeling, and reinforcement learning.

Result: Empirical validation shows forgetting is present across all learning settings and significantly impacts learning efficiency, with the proposed measure effectively capturing forgetting behavior.

Conclusion: Establishes a principled foundation for understanding forgetting and provides tools for analyzing and improving information retention in general learning algorithms.

Abstract: A fundamental challenge in developing general learning algorithms is their tendency to forget past knowledge when adapting to new data. Addressing this problem requires a principled understanding of forgetting; yet, despite decades of study, no unified definition has emerged that provides insights into the underlying dynamics of learning. We propose an algorithm- and task-agnostic theory that characterises forgetting as a lack of self-consistency in a learner’s predictive distribution over future experiences, manifesting as a loss of predictive information. Our theory naturally yields a general measure of an algorithm’s propensity to forget. To validate the theory, we design a comprehensive set of experiments that span classification, regression, generative modelling, and reinforcement learning. We empirically demonstrate how forgetting is present across all learning settings and plays a significant role in determining learning efficiency. Together, these results establish a principled understanding of forgetting and lay the foundation for analysing and improving the information retention capabilities of general learning algorithms.

[321] Multi-Method Analysis of Mathematics Placement Assessments: Classical, Machine Learning, and Clustering Approaches

Julian D. Allagan, Dasia A. Singleton, Shanae N. Perry, Gabrielle C. Morgan, Essence A. Morgan

Main category: cs.LG

TL;DR: A multi-method analysis of a 40-item math placement exam using Classical Test Theory, machine learning, and clustering revealed excellent discrimination for 55% of items, identified Question 6 as the strongest discriminator, and found machine learning models achieving 97.5% accuracy. Clustering suggested a natural competency boundary at 42.5%, diverging from the institutional 55% threshold.

Details

Motivation: To evaluate and optimize mathematics placement examinations through a comprehensive multi-method framework that combines traditional psychometric analysis with modern machine learning and clustering techniques.

Method: Used a multi-method framework combining Classical Test Theory (item discrimination analysis), machine learning (Random Forest and Gradient Boosting), and unsupervised clustering (K-means) on a 40-item mathematics placement exam administered to 198 students.

Result: 55% of items showed excellent discrimination, 30% poor discrimination requiring replacement. Question 6 was the strongest discriminator. Machine learning achieved 97.5% accuracy. Clustering revealed a natural competency boundary at 42.5% vs institutional 55%, suggesting potential overclassification into remedial categories.

Conclusion: Multi-method integration provides robust empirical foundation for evidence-based placement optimization. Recommendations include replacing poorly discriminating items, implementing two-stage assessment, and integrating Random Forest predictions with transparency mechanisms.

Abstract: This study evaluates a 40-item mathematics placement examination administered to 198 students using a multi-method framework combining Classical Test Theory, machine learning, and unsupervised clustering. Classical Test Theory analysis reveals that 55% of items achieve excellent discrimination ($D \geq 0.40$) while 30% demonstrate poor discrimination ($D < 0.20$) requiring replacement. Question 6 (Graph Interpretation) emerges as the examination’s most powerful discriminator, achieving perfect discrimination ($D = 1.000$), highest ANOVA F-statistic ($F = 4609.1$), and maximum Random Forest feature importance (0.206), accounting for 20.6% of predictive power. Machine learning algorithms demonstrate exceptional performance, with Random Forest and Gradient Boosting achieving 97.5% and 96.0% cross-validation accuracy. K-means clustering identifies a natural binary competency structure with a boundary at 42.5%, diverging from the institutional threshold of 55% and suggesting potential overclassification into remedial categories. The two-cluster solution exhibits exceptional stability (bootstrap ARI = 0.855) with perfect lower-cluster purity. Convergent evidence across methods supports specific refinements: replace poorly discriminating items, implement a two-stage assessment, and integrate Random Forest predictions with transparency mechanisms. These findings demonstrate that multi-method integration provides a robust empirical foundation for evidence-based mathematics placement optimization.

[322] A Unified Kernel for Neural Network Learning

Shao-Qun Zhang, Zong-Yi Chen, Yong-Ming Tian, Xun Lu

Main category: cs.LG

TL;DR: The paper proposes Unified Neural Kernel (UNK) that connects NNGP and NTK approaches, showing NTK-like behavior with finite learning steps and convergence to NNGP as learning steps approach infinity.

Details

Motivation: To bridge the gap between Neural Network Gaussian Process (NNGP) and Neural Tangent Kernel (NTK) approaches for understanding neural network learning dynamics, providing a unified framework.

Method: Developed UNK kernel induced by inner product of produced variables, characterizing learning dynamics of neural networks with gradient descent and parameter initialization.

Result: UNK maintains limiting properties of both NNGP and NTK, exhibits NTK-like behavior with finite learning steps, converges to NNGP as learning steps approach infinity, with theoretical guarantees for uniform tightness and learning convergence.

Conclusion: UNK provides a comprehensive unified kernel framework that effectively connects and generalizes both NNGP and NTK approaches, with experimental validation of its effectiveness.

Abstract: Past decades have witnessed a great interest in the distinction and connection between neural network learning and kernel learning. Recent advancements have made theoretical progress in connecting infinite-wide neural networks and Gaussian processes. Two predominant approaches have emerged: the Neural Network Gaussian Process (NNGP) and the Neural Tangent Kernel (NTK). The former, rooted in Bayesian inference, represents a zero-order kernel, while the latter, grounded in the tangent space of gradient descents, is a first-order kernel. In this paper, we present the Unified Neural Kernel (UNK), which {is induced by the inner product of produced variables and characterizes the learning dynamics of neural networks with gradient descents and parameter initialization.} The proposed UNK kernel maintains the limiting properties of both NNGP and NTK, exhibiting behaviors akin to NTK with a finite learning step and converging to NNGP as the learning step approaches infinity. Besides, we also theoretically characterize the uniform tightness and learning convergence of the UNK kernel, providing comprehensive insights into this unified kernel. Experimental results underscore the effectiveness of our proposed method.

[323] Stochastic Diffusion: A Diffusion Probabilistic Model for Stochastic Time Series Forecasting

Yuansan Liu, Sudanthi Wijewickrema, Dongting Hu, Christofer Bester, Stephen O’Leary, James Bailey

Main category: cs.LG

TL;DR: StochDiff: A novel diffusion model that learns data-driven priors using stochastic latent spaces to better model highly stochastic multivariate time series data.

Details

Motivation: Leverage diffusion models' generative capabilities for time series forecasting, particularly addressing the challenge of modeling highly stochastic time series data where existing approaches struggle.

Method: Proposes Stochastic Diffusion (StochDiff) model that learns data-driven prior knowledge at each time step by utilizing stochastic latent spaces to capture the variability in multivariate time series data.

Result: Demonstrated effectiveness on real-world datasets for stochastic time series forecasting, with successful application in surgical guidance scenarios.

Conclusion: StochDiff improves modeling of highly stochastic time series by capturing complex temporal dynamics and inherent uncertainty, showing potential for medical applications.

Abstract: Recent innovations in diffusion probabilistic models have paved the way for significant progress in image, text and audio generation, leading to their applications in generative time series forecasting. However, leveraging such abilities to model highly stochastic time series data remains a challenge. In this paper, we propose a novel Stochastic Diffusion (StochDiff) model which learns data-driven prior knowledge at each time step by utilizing the representational power of the stochastic latent spaces to model the variability of the multivariate time series data. The learnt prior knowledge helps the model to capture complex temporal dynamics and the inherent uncertainty of the data. This improves its ability to model highly stochastic time series data. Through extensive experiments on real-world datasets, we demonstrate the effectiveness of our proposed model on stochastic time series forecasting. Additionally, we showcase an application of our model for real-world surgical guidance, highlighting its potential to benefit the medical community.

[324] Beyond the Kolmogorov Barrier: A Learnable Weighted Hybrid Autoencoder for Model Order Reduction

Nithin Somasekharan, Shaowu Pan

Main category: cs.LG

TL;DR: A learnable weighted hybrid autoencoder that combines SVD with deep autoencoders to overcome poor convergence in high-dimensional physical system representation learning.

Details

Motivation: To address the poor convergence behavior of deep autoencoders as latent space rank increases, overcoming the Kolmogorov barrier for high-dimensional complex physical systems.

Method: Proposes a learnable weighted hybrid autoencoder that combines singular value decomposition (SVD) with deep autoencoders through learnable weighting parameters.

Result: Significantly improves generalization performance on chaotic PDE systems, achieves sharpness thousands of times smaller than other models, and enhances surrogate modeling when combined with time series techniques.

Conclusion: The learnable weighting framework is essential for successful hybrid modeling, enabling improved representation learning and surrogate modeling for high-dimensional multi-scale PDE systems.

Abstract: Representation learning for high-dimensional, complex physical systems aims to identify a low-dimensional intrinsic latent space, which is crucial for reduced-order modeling and modal analysis. To overcome the well-known Kolmogorov barrier, deep autoencoders (AEs) have been introduced in recent years, but they often suffer from poor convergence behavior as the rank of the latent space increases. To address this issue, we propose the learnable weighted hybrid autoencoder, a hybrid approach that combines the strengths of singular value decomposition (SVD) with deep autoencoders through a learnable weighted framework. We find that the introduction of learnable weighting parameters is essential – without them, the resulting model would either collapse into a standard POD or fail to exhibit the desired convergence behavior. Interestingly, we empirically find that our trained model has a sharpness thousands of times smaller compared to other models. Our experiments on classical chaotic PDE systems, including the 1D Kuramoto-Sivashinsky and forced isotropic turbulence datasets, demonstrate that our approach significantly improves generalization performance compared to several competing methods. Additionally, when combining with time series modeling techniques (e.g., Koopman operator, LSTM), the proposed technique offers significant improvements for surrogate modeling of high-dimensional multi-scale PDE systems.

[325] Understanding Adam Requires Better Rotation Dependent Assumptions

Tianyue H. Zhang, Lucas Maes, Alan Milligan, Alexia Jolicoeur-Martineau, Ioannis Mitliagkas, Damien Scieur, Simon Lacoste-Julien, Charles Guille-Escuret

Main category: cs.LG

TL;DR: Adam’s performance advantage over SGD lacks theoretical explanation. The paper shows Adam is sensitive to parameter space rotations, with performance degrading under random rotations but preserved under structured rotations, suggesting orthogonality of updates may be key to understanding its basis-dependent behavior.

Details

Motivation: Despite Adam's widespread adoption, there's no comprehensive theoretical explanation for its advantage over SGD. The paper aims to understand Adam's sensitivity to parameter space rotations and identify the rotation-dependent properties that contribute to its empirical success.

Method: The authors investigate Adam’s sensitivity to rotations by: (1) testing performance under random rotations of parameter space, (2) identifying structured rotations that preserve/enhance performance, (3) examining existing rotation-dependent assumptions in literature, and (4) verifying orthogonality of updates as an indicator of basis sensitivity.

Result: Adam’s performance in training transformers degrades under random rotations, showing crucial sensitivity to basis choice. Structured rotations can preserve or enhance performance. Existing rotation-dependent assumptions fail to explain Adam’s behavior across rotation types, while orthogonality of updates appears to be a promising indicator of basis sensitivity.

Conclusion: Conventional rotation-invariant assumptions are insufficient to explain Adam’s advantages. Orthogonality of updates may be the key quantity for developing rotation-dependent theoretical frameworks that better explain Adam’s empirical success, suggesting future theoretical work should focus on basis-dependent properties.

Abstract: Despite its widespread adoption, Adam’s advantage over Stochastic Gradient Descent (SGD) lacks a comprehensive theoretical explanation. This paper investigates Adam’s sensitivity to rotations of the parameter space. We observe that Adam’s performance in training transformers degrades under random rotations of the parameter space, indicating a crucial sensitivity to the choice of basis in practice. This reveals that conventional rotation-invariant assumptions are insufficient to capture Adam’s advantages theoretically. To better understand the rotation-dependent properties that benefit Adam, we also identify structured rotations that preserve or even enhance its empirical performance. We then examine the rotation-dependent assumptions in the literature and find that they fall short in explaining Adam’s behaviour across various rotation types. In contrast, we verify the orthogonality of the update as a promising indicator of Adam’s basis sensitivity, suggesting it may be the key quantity for developing rotation-dependent theoretical frameworks that better explain its empirical success.

[326] GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs

Advik Raj Basani, Xiao Zhang

Main category: cs.LG

TL;DR: GASP is an automated framework that uses latent Bayesian optimization to generate human-readable jailbreak prompts for LLMs, improving attack success while maintaining prompt coherence.

Details

Motivation: Traditional jailbreak attack methods have limitations: manual heuristics lack generalizability, while optimization-based attacks produce unnatural prompts that are easily detected or require high computational costs.

Method: GASP leverages latent Bayesian optimization to craft adversarial suffixes by exploring continuous latent embedding spaces, using a targeted iterative refinement procedure to balance attack efficacy and prompt coherence.

Result: GASP significantly improves jailbreak success over baselines, reduces training times, and accelerates inference speed while producing natural adversarial prompts.

Conclusion: GASP provides an efficient and scalable solution for red-teaming LLMs, addressing the limitations of traditional jailbreak attack methods.

Abstract: LLMs have shown impressive capabilities across various natural language processing tasks, yet remain vulnerable to input prompts, known as jailbreak attacks, carefully designed to bypass safety guardrails and elicit harmful responses. Traditional methods rely on manual heuristics but suffer from limited generalizability. Despite being automatic, optimization-based attacks often produce unnatural prompts that can be easily detected by safety filters or require high computational costs due to discrete token optimization. In this paper, we introduce Generative Adversarial Suffix Prompter (GASP), a novel automated framework that can efficiently generate human-readable jailbreak prompts in a fully black-box setting. In particular, GASP leverages latent Bayesian optimization to craft adversarial suffixes by efficiently exploring continuous latent embedding spaces, gradually optimizing the suffix prompter to improve attack efficacy while balancing prompt coherence via a targeted iterative refinement procedure. Through comprehensive experiments, we show that GASP can produce natural adversarial prompts, significantly improving jailbreak success over baselines, reducing training times, and accelerating inference speed, thus making it an efficient and scalable solution for red-teaming LLMs.

[327] Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models

Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin, Kai-Chiang Wu, Mohamed S. Abdelfattah, Diana Marculescu

Main category: cs.LG

TL;DR: Quamba2 is a quantization framework for State Space Models (SSMs) that supports multiple bit-width configurations (W8A8, W4A8, W4A16) for both Mamba1 and Mamba2 backbones, achieving significant speed-ups and memory reduction with minimal accuracy loss.

Details

Motivation: SSMs face challenges in scaling due to storage requirements and computational power limitations. Quantization can reduce model size and leverage hardware acceleration, but existing methods are optimized for specific configurations rather than supporting the diverse bit-width needs of different deployment scenarios.

Method: Proposes an offline quantization approach that preserves channel order and activation persistence in SSMs. It quantizes linear recurrence inputs in 8-bit through sorting and clustering, uses per-state-group quantization for input-dependent parameters, and rearranges weights offline to maintain compute-invariance in SSM output.

Result: Quamba2-8B outperforms state-of-the-art SSM quantization methods, achieving 1.3× speed-up in pre-filling, 3× speed-up in generation stages, and 4× memory reduction with only 1.6% average accuracy drop. Evaluation on MMLU demonstrates generalizability and robustness.

Conclusion: Quamba2 provides an effective solution for deploying SSMs across various platforms by supporting multiple bit-width configurations while maintaining performance and efficiency, addressing the growing demand for SSM deployment in diverse scenarios.

Abstract: State Space Models (SSMs) are emerging as a compelling alternative to Transformers because of their consistent memory usage and high performance. Despite this, scaling up SSMs on cloud services or limited-resource devices is challenging due to their storage requirements and computational power. To overcome this, quantizing SSMs with low bit-width data formats can reduce model size and benefit from hardware acceleration. As SSMs are prone to quantization-induced errors, recent efforts have focused on optimizing a particular model or bit-width for efficiency without sacrificing performance. However, distinct bit-width configurations are essential for different scenarios, like W4A8 for boosting large-batch decoding speed, and W4A16 for enhancing generation speed in short prompt applications for a single user. To this end, we present Quamba2, compatible with W8A8, W4A8, and W4A16 for both Mamba1 and Mamba2 backbones, addressing the growing demand for SSM deployment on various platforms. Based on the channel order preserving and activation persistence of SSMs, we propose an offline approach to quantize inputs of a linear recurrence in 8-bit by sorting and clustering for input $x$, combined with a per-state-group quantization for input-dependent parameters $B$ and $C$. To ensure compute-invariance in the SSM output, we rearrange weights offline according to the clustering sequence. The experiments show that Quamba2-8B outperforms two state-of-the-art SSM quantization methods and delivers 1.3$\times$ and 3$\times$ speed-ups in the pre-filling and generation stages, respectively, while offering 4$\times$ memory reduction with only a $1.6%$ average accuracy drop. The evaluation on MMLU shows the generalizability and robustness of our framework. The code and quantized models will be released at: https://github.com/enyac-group/Quamba.

[328] How Memory in Optimization Algorithms Implicitly Modifies the Loss

Matias D. Cattaneo, Boris Shigida

Main category: cs.LG

TL;DR: The paper introduces a technique to approximate optimization algorithms with memory by converting them to memoryless versions with correction terms, revealing how memory implicitly regularizes optimization dynamics.

Details

Motivation: To understand how memory in optimization algorithms (like momentum) affects training dynamics and generalization, and to provide theoretical explanations for performance differences between algorithms like AdamW and Lion.

Method: Develop a general technique that replaces past iterates with current ones in update rules, adding a correction term that represents memory effects as a perturbation of the loss function.

Result: The analysis shows that Lion does not exhibit the same implicit anti-regularization induced by memory that AdamW does, explaining Lion’s better generalization performance.

Conclusion: Memory in optimization algorithms can be understood as implicit regularization, and the proposed technique provides a theoretical framework to analyze and compare different optimization methods’ generalization properties.

Abstract: In modern optimization methods used in deep learning, each update depends on the history of previous iterations, often referred to as memory, and this dependence decays fast as the iterates go further into the past. For example, gradient descent with momentum has exponentially decaying memory through exponentially averaged past gradients. We introduce a general technique for identifying a memoryless algorithm that approximates an optimization algorithm with memory. It is obtained by replacing all past iterates in the update by the current one, and then adding a correction term arising from memory (also a function of the current iterate). This correction term can be interpreted as a perturbation of the loss, and the nature of this perturbation can inform how memory implicitly (anti-)regularizes the optimization dynamics. As an application of our theory, we find that Lion does not have the kind of implicit anti-regularization induced by memory that AdamW does, providing a theory-based explanation for Lion’s better generalization performance recently documented.

[329] Multimodal Cancer Modeling in the Age of Foundation Model Embeddings

Steven Song, Morgan Borjigin-Wang, Irene Madejski, Robert L. Grossman

Main category: cs.LG

TL;DR: The paper proposes using foundation model embeddings for multimodal cancer data analysis, showing that combining text (pathology reports) with other data improves performance over unimodal approaches.

Details

Motivation: TCGA contains underutilized free-text pathology reports, and foundation models can create task-agnostic embeddings for better cancer modeling.

Method: Train classical ML models using multimodal, zero-shot foundation model embeddings of cancer data, including pathology report text and evaluating text summarization effects.

Result: Multimodal fusion outperforms unimodal models, and including pathology report text provides benefits despite potential summarization artifacts.

Conclusion: An embedding-centric approach to multimodal cancer modeling is effective and enables easy integration of diverse data types.

Abstract: The Cancer Genome Atlas (TCGA) has enabled novel discoveries and served as a large-scale reference dataset in cancer through its harmonized genomics, clinical, and imaging data. Numerous prior studies have developed bespoke deep learning models over TCGA for tasks such as cancer survival prediction. A modern paradigm in biomedical deep learning is the development of foundation models (FMs) to derive feature embeddings agnostic to a specific modeling task. Biomedical text especially has seen growing development of FMs. While TCGA contains free-text data as pathology reports, these have been historically underutilized. Here, we investigate the ability to train classical machine learning models over multimodal, zero-shot FM embeddings of cancer data. We demonstrate the ease and additive effect of multimodal fusion, outperforming unimodal models. Further, we show the benefit of including pathology report text and rigorously evaluate the effect of model-based text summarization and hallucination. Overall, we propose an embedding-centric approach to multimodal cancer modeling.

[330] Learning Dynamics of RNNs in Closed-Loop Environments

Yoav Ger, Omri Barak

Main category: cs.LG

TL;DR: The paper develops a mathematical theory for learning dynamics of linear RNNs in closed-loop environments, showing they follow different trajectories than open-loop training due to competing objectives of policy improvement and stability.

Details

Motivation: Real-world learning occurs in closed-loop environments where agents interact with their surroundings, but most neuroscience-inspired RNN models are trained in open-loop supervised settings, creating a gap between artificial and biological learning.

Method: Developed mathematical theory to characterize learning dynamics of linear RNNs in closed-loop contexts, analyzed distinct learning stages aligned with training loss evolution, and applied framework to motor control task.

Result: Found that closed-loop RNNs follow markedly different learning trajectories than identical open-loop RNNs, with dynamics governed by interplay between short-term policy improvement and long-term stability of agent-environment interaction.

Conclusion: Results underscore the importance of modeling closed-loop dynamics for biologically plausible learning models, as closed-loop training captures essential aspects of real-world learning that open-loop approaches miss.

Abstract: Recurrent neural networks (RNNs) trained on neuroscience-inspired tasks offer powerful models of brain computation. However, typical training paradigms rely on open-loop, supervised settings, whereas real-world learning unfolds in closed-loop environments. Here, we develop a mathematical theory describing the learning dynamics of linear RNNs trained in closed-loop contexts. We first demonstrate that two otherwise identical RNNs, trained in either closed- or open-loop modes, follow markedly different learning trajectories. To probe this divergence, we analytically characterize the closed-loop case, revealing distinct stages aligned with the evolution of the training loss. Specifically, we show that the learning dynamics of closed-loop RNNs, in contrast to open-loop ones, are governed by an interplay between two competing objectives: short-term policy improvement and long-term stability of the agent-environment interaction. Finally, we apply our framework to a realistic motor control task, highlighting its broader applicability. Taken together, our results underscore the importance of modeling closed-loop dynamics in a biologically plausible setting.

[331] Latent learning: episodic memory complements parametric learning by enabling flexible reuse of experiences

Andrew Kyle Lampinen, Martin Engelcke, Yuxuan Li, Arslan Chaudhry, James L. McClelland

Main category: cs.LG

TL;DR: Machine learning systems fail to generalize due to lack of latent learning - learning information not immediately relevant but potentially useful later. Episodic memory with oracle retrieval can improve generalization across challenges like the reversal curse and navigation tasks.

Details

Motivation: To understand why machine learning systems struggle with generalization compared to natural intelligence, drawing inspiration from cognitive science concepts like latent learning that humans exhibit but current ML systems lack.

Method: Analyzed failures like reversal curse in language modeling and agent navigation, then implemented systems with oracle retrieval mechanisms to test if episodic memory can improve flexible use of learning experiences.

Result: Systems with oracle retrieval demonstrated better generalization across multiple challenges. Within-example in-context learning was identified as crucial for effectively using information across retrieved examples.

Conclusion: Retrieval methods can complement parametric learning to address data inefficiency in ML systems. The findings connect to cognitive science and neuroscience, suggesting episodic memory as a key mechanism for improving generalization.

Abstract: When do machine learning systems fail to generalize, and what mechanisms could improve their generalization? Here, we draw inspiration from cognitive science to argue that one weakness of parametric machine learning systems is their failure to exhibit latent learning – learning information that is not relevant to the task at hand, but that might be useful in a future task. We show how this perspective links failures ranging from the reversal curse in language modeling to new findings on agent-based navigation. We then highlight how cognitive science points to episodic memory as a potential part of the solution to these issues. Correspondingly, we show that a system with an oracle retrieval mechanism can use learning experiences more flexibly to generalize better across many of these challenges. We also identify some of the essential components for effectively using retrieval, including the importance of within-example in-context learning for acquiring the ability to use information across retrieved examples. In summary, our results illustrate one possible contributor to the relative data inefficiency of current machine learning systems compared to natural intelligence, and help to understand how retrieval methods can complement parametric learning to improve generalization. We close by discussing some of the links between these findings and prior results in cognitive science and neuroscience, and the broader implications.

[332] But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors

Leon Eshuijs, Archie Chaudhury, Alan McBeth, Ethan Nguyen

Main category: cs.LG

TL;DR: JUSSA is a framework that uses steering vectors to generate contrastive examples, improving LLM judges’ ability to detect subtle dishonesty like sycophancy and manipulation.

Details

Motivation: Current methods struggle to detect subtle dishonest behaviors in LLMs, as they appear through small biases rather than clear false statements, making evaluation challenging.

Method: JUSSA applies steering vectors during inference to create more honest alternatives, providing judges with contrastive examples that highlight subtle dishonest patterns.

Result: JUSSA improves detection accuracy over single-response evaluation across various cases, helps weaker judges on easier tasks and stronger judges on harder tasks, and reveals where steering interventions are most effective.

Conclusion: Steering vectors can enhance safety evaluation rather than just modify behavior, opening new directions for scalable model auditing as systems become more sophisticated.

Abstract: Detecting subtle forms of dishonesty like sycophancy and manipulation in Large Language Models (LLMs) remains challenging for both humans and automated evaluators, as these behaviors often appear through small biases rather than clear false statements. We introduce Judge Using Safety-Steered Alternatives (JUSSA), a novel framework that employs steering vectors not to improve model behavior directly, but to enhance LLM judges’ evaluation capabilities. JUSSA applies steering vectors during inference to generate more honest alternatives, providing judges with contrastive examples that make subtle dishonest patterns easier to detect. While existing evaluation methods rely on black-box evaluation, JUSSA leverages model internals to create targeted comparisons from single examples. We evaluate our method on sycophancy detection and introduce a new manipulation dataset covering multiple types of manipulation. Our results demonstrate that JUSSA effectively improves detection accuracy over single-response evaluation in various cases. Analysis across judge models reveals that JUSSA helps weaker judges on easier dishonesty detection tasks, and stronger judges on harder tasks. Layer-wise experiments show how dishonest prompts cause representations to diverge from honest ones in middle layers, revealing where steering interventions are most effective for generating contrastive examples. By demonstrating that steering vectors can enhance safety evaluation rather than just modify behavior, our work opens new directions for scalable model auditing as systems become increasingly sophisticated.

[333] Composite Flow Matching for Reinforcement Learning with Shifted-Dynamics Data

Lingkai Kong, Haichuan Wang, Tonghan Wang, Guojun Xiong, Milind Tambe

Main category: cs.LG

TL;DR: CompFlow improves RL sample efficiency by using optimal transport theory to handle dynamics discrepancies between source and target environments, enabling principled dynamics gap estimation and optimistic exploration.

Details

Motivation: Existing methods for using offline data in RL struggle with dynamics discrepancies between source and target environments, particularly when using ill-defined metrics like KL divergence that fail with disjoint support.

Method: Models target dynamics as conditional flow built on source-domain flow output, uses Wasserstein distance for principled dynamics gap estimation, and implements optimistic active data collection in high-gap regions.

Result: Empirically outperforms strong baselines across multiple RL benchmarks with shifted dynamics.

Conclusion: CompFlow provides a theoretically grounded approach for dynamics-aware RL that effectively leverages source data while handling dynamics discrepancies through optimal transport principles.

Abstract: Incorporating pre-collected offline data from a source environment can significantly improve the sample efficiency of reinforcement learning (RL), but this benefit is often challenged by discrepancies between the transition dynamics of the source and target environments. Existing methods typically address this issue by penalizing or filtering out source transitions in high dynamics-gap regions. However, their estimation of the dynamics gap often relies on KL divergence or mutual information, which can be ill-defined when the source and target dynamics have disjoint support. To overcome these limitations, we propose CompFlow, a method grounded in the theoretical connection between flow matching and optimal transport. Specifically, we model the target dynamics as a conditional flow built upon the output distribution of the source-domain flow, rather than learning it directly from a Gaussian prior. This composite structure offers two key advantages: (1) improved generalization for learning target dynamics, and (2) a principled estimation of the dynamics gap via the Wasserstein distance between source and target transitions. Leveraging our principled estimation of the dynamics gap, we further introduce an optimistic active data collection strategy that prioritizes exploration in regions of high dynamics gap, and theoretically prove that it reduces the performance disparity with the optimal policy. Empirically, CompFlow outperforms strong baselines across several RL benchmarks with shifted dynamics.

[334] HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts

Neil He, Rishabh Anand, Hiren Madhu, Ali Maatouk, Smita Krishnaswamy, Leandros Tassiulas, Menglin Yang, Rex Ying

Main category: cs.LG

TL;DR: HELM introduces hyperbolic geometry to large language models, addressing limitations of Euclidean operations and achieving up to 4% performance gains on benchmarks like MMLU and ARC.

Details

Motivation: Current LLMs rely on Euclidean operations that don't capture the inherent semantic hierarchies and geometric structure of natural language, leading to training instabilities and degraded generative capabilities.

Method: Proposes HELM family of hyperbolic LLMs including HELM-MICE with Mixture-of-Curvature Experts and hyperbolic Multi-Head Latent Attention, plus HELM-D dense model, with hyperbolic equivalents of rotary positional encodings and RMS normalization.

Result: HELM models achieve consistent gains up to 4% over Euclidean architectures (LLaMA, DeepSeek) on benchmarks spanning STEM problem-solving, general knowledge, and commonsense reasoning.

Conclusion: Hyperbolic geometry provides enhanced reasoning capabilities and better alignment with the underlying geometry of text in large-scale language model pretraining.

Abstract: Large language models (LLMs) have shown great success in text modeling tasks across domains. However, natural language exhibits inherent semantic hierarchies and nuanced geometric structure, which current LLMs do not capture completely owing to their reliance on Euclidean operations. Recent studies have also shown that not respecting the geometry of token embeddings leads to training instabilities and degradation of generative capabilities. These findings suggest that shifting to non-Euclidean geometries can better align language models with the underlying geometry of text. We thus propose to operate fully in Hyperbolic space, known for its expansive, scale-free, and low-distortion properties. We thus introduce HELM, a family of HypErbolic Large Language Models, offering a geometric rethinking of the Transformer-based LLM that addresses the representational inflexibility, missing set of necessary operations, and poor scalability of existing hyperbolic LMs. We additionally introduce a Mixture-of-Curvature Experts model, HELM-MICE, where each expert operates in a distinct curvature space to encode more fine-grained geometric structure from text, as well as a dense model, HELM-D. For HELM-MICE, we further develop hyperbolic Multi-Head Latent Attention (HMLA) for efficient, reduced-KV-cache training and inference. For both models, we develop essential hyperbolic equivalents of rotary positional encodings and RMS normalization. We are the first to train fully hyperbolic LLMs at billion-parameter scale, and evaluate them on well-known benchmarks such as MMLU and ARC, spanning STEM problem-solving, general knowledge, and commonsense reasoning. Our results show consistent gains from our HELM architectures – up to 4% – over popular Euclidean architectures used in LLaMA and DeepSeek, highlighting the efficacy and enhanced reasoning afforded by hyperbolic geometry in large-scale LM pretraining.

[335] Explicit Density Approximation for Neural Implicit Samplers Using a Bernstein-Based Convex Divergence

José Manuel de Frutos, Manuel A. Vázquez, Pablo M. Olmos, Joaquín Míguez

Main category: cs.LG

TL;DR: Dual-ISL is a novel likelihood-free objective for training implicit generative models that interchanges target and model distributions in the ISL framework, creating a convex optimization problem with improved convergence, stability, and mode collapse prevention.

Details

Motivation: To address limitations of classical divergences (KL, Wasserstein) and improve upon existing rank-based metrics like ISL by creating a convex optimization framework with better theoretical properties and practical performance.

Method: Proposes dual-ISL by interchanging roles of target and model distributions in ISL framework, interprets the resulting discrepancy as L²-projection of density ratio onto Bernstein polynomial basis, and extends to multivariate setting via random one-dimensional projections.

Result: Dual-ISL achieves faster convergence, smoother and more stable training, better mode collapse prevention than classical ISL and other methods, while providing explicit density approximation. The method is proven to be continuous under weak convergence and convex in its first argument.

Conclusion: Dual-ISL provides a theoretically sound and practically effective framework for training implicit generative models with superior convergence properties, stability, and mode collapse prevention compared to existing methods.

Abstract: Rank-based statistical metrics, such as the invariant statistical loss (ISL), have recently emerged as robust and practically effective tools for training implicit generative models. In this work, we introduce dual-ISL, a novel likelihood-free objective for training implicit generative models that interchanges the roles of the target and model distributions in the ISL framework, yielding a convex optimization problem in the space of model densities. We prove that the resulting rank-based discrepancy $d_K$ is i) continuous under weak convergence and with respect to the $L^1$ norm, and ii) convex in its first argument-properties not shared by classical divergences such as KL or Wasserstein distances. Building on this, we develop a theoretical framework that interprets $d_K$ as an $L^2$-projection of the density ratio $q = p/\tilde p$ onto a Bernstein polynomial basis, from which we derive exact bounds on the truncation error, precise convergence rates, and a closed-form expression for the truncated density approximation. We further extend our analysis to the multivariate setting via random one-dimensional projections, defining a sliced dual-ISL divergence that retains both convexity and continuity. We empirically show that these theoretical advantages translate into practical ones. Specifically, across several benchmarks dual-ISL converges more rapidly, delivers markedly smoother and more stable training, and more effectively prevents mode collapse than classical ISL and other leading implicit generative methods-while also providing an explicit density approximation.

[336] Deep Graph Learning for Industrial Carbon Emission Analysis and Policy Impact

Xuanming Zhang

Main category: cs.LG

TL;DR: A novel graph-based deep learning framework (DGL) that combines Graph Neural Networks with temporal transformers to forecast industrial CO2 emissions, addressing multicollinearity and capturing complex industrial-temporal dependencies while maintaining interpretability.

Details

Motivation: Industrial carbon emissions are major climate change drivers, but modeling is challenging due to multicollinearity among factors and complex interdependencies across sectors and time. Traditional methods struggle with these complexities.

Method: Graph Neural Network with attention mechanisms to model industry/region relationships, combined with temporal transformer for long-range patterns. Uses structural encoding to resolve multicollinearity and integrates causal inference for interpretability.

Result: Achieves superior predictive performance with over 15% error reduction compared to baseline deep models. Identifies high-emission hotspots and enables sector-specific decarbonization strategies aligned with sustainable development goals.

Conclusion: The proposed Graph-Temporal architecture offers a powerful tool for policymakers and industry stakeholders to achieve carbon reduction targets, advancing climate action through state-of-the-art AI graph learning with improved transparency and fairness.

Abstract: Industrial carbon emissions are a major driver of climate change, yet modeling these emissions is challenging due to multicollinearity among factors and complex interdependencies across sectors and time. We propose a novel graph-based deep learning framework DGL to analyze and forecast industrial CO_2 emissions, addressing high feature correlation and capturing industrial-temporal interdependencies. Unlike traditional regression or clustering methods, our approach leverages a Graph Neural Network (GNN) with attention mechanisms to model relationships between industries (or regions) and a temporal transformer to learn long-range patterns. We evaluate our framework on public global industry emissions dataset derived from EDGAR v8.0, spanning multiple countries and sectors. The proposed model achieves superior predictive performance - reducing error by over 15% compared to baseline deep models - while maintaining interpretability via attention weights and causal analysis. We believe that we are the first Graph-Temporal architecture that resolves multicollinearity by structurally encoding feature relationships, along with integration of causal inference to identify true drivers of emissions, improving transparency and fairness. We also stand a demonstration of policy relevance, showing how model insights can guide sector-specific decarbonization strategies aligned with sustainable development goals. Based on the above, we show high-emission “hotspots” and suggest equitable intervention plans, illustrating the potential of state-of-the-art AI graph learning to advance climate action, offering a powerful tool for policymakers and industry stakeholders to achieve carbon reduction targets.

[337] GENIAL: Generative Design Space Exploration via Network Inversion for Low Power Algorithmic Logic Units

Maxence Bouvier, Ryan Amaudruz, Felix Arnold, Renzo Andri, Lukas Cavigelli

Main category: cs.LG

TL;DR: GENIAL is a machine learning framework that automatically generates and optimizes arithmetic units using a Transformer-based surrogate model to predict hardware metrics and search for optimal operand encodings, achieving up to 18% power savings.

Details

Motivation: Conventional design flows for arithmetic units rely on manual or heuristic-based optimization, which cannot thoroughly explore the vast design space needed for AI workloads, creating a need for automated optimization.

Method: Uses a Transformer-based surrogate model trained in two stages (self-supervised pretraining + supervised finetuning) to predict hardware metrics, then inverts the model to search for optimal operand encodings that minimize power consumption.

Result: Achieves up to 18% switching activity savings in multipliers compared to conventional two’s complement, demonstrates better sample efficiency and faster convergence than other methods, and shows versatility with significant improvements on Finite State Machines.

Conclusion: GENIAL represents a significant advancement toward automated Quality-of-Results-optimized combinational circuit generation for digital systems, applicable to a wide spectrum of logic functions.

Abstract: As AI workloads proliferate, optimizing arithmetic units is becoming increasingly important for reducing the footprint of digital systems. Conventional design flows, which often rely on manual or heuristic-based optimization, are limited in their ability to thoroughly explore the vast design space. In this paper, we introduce GENIAL, a machine learning-based framework for the automatic generation and optimization of arithmetic units, with a focus on multipliers. At the core of GENIAL is a Transformer-based surrogate model trained in two stages, involving self-supervised pretraining followed by supervised finetuning, to robustly forecast key hardware metrics such as power and area from abstracted design representations. By inverting the surrogate model, GENIAL efficiently searches for new operand encodings that directly minimize power consumption in arithmetic units for specific input data distributions. Extensive experiments on large datasets demonstrate that GENIAL is consistently more sample efficient than other methods, and converges faster towards optimized designs. This enables deployment of a high-effort logic synthesis optimization flow in the loop, improving the accuracy of the surrogate model. Notably, GENIAL automatically discovers encodings that achieve up to 18% switching activity savings within multipliers on representative AI workloads compared with the conventional two’s complement. We also demonstrate the versatility of our approach by achieving significant improvements on Finite State Machines, highlighting GENIAL’s applicability for a wide spectrum of logic functions. Together, these advances mark a significant step toward automated Quality-of-Results-optimized combinational circuit generation for digital systems.

[338] A Multi-target Bayesian Transformer Framework for Predicting Cardiovascular Disease Biomarkers during Pandemics

Trusting Inekwe, Winnie Mkandawire, Emmanuel Agu, Andres Colubri

Main category: cs.LG

TL;DR: Proposed MBT-CB, a Multi-target Bayesian Transformer model for jointly predicting CVD biomarkers (LDL-C, HbA1c, BMI, SysBP) from EHR data during COVID-19, capturing interdependencies, temporal patterns, and uncertainty.

Details

Motivation: COVID-19 disrupted healthcare for CVD patients, affecting key biomarkers. Need accurate modeling to predict disease progression and guide preventive care, addressing gaps in multi-target prediction from EHRs with ML.

Method: Multi-target Bayesian Transformer (MBT) with pre-trained BERT framework, using Bayesian Variational Inference for uncertainty, embeddings for temporal relationships, and DeepMTR for biomarker inter-relationships.

Result: Outperformed baselines with MAE 0.00887, RMSE 0.0135, MSE 0.00027 on 3,390 CVD patient records. Effectively captured uncertainty, biomarker relationships, and temporal dynamics.

Conclusion: MBT-CB shows superior performance for CVD biomarker prediction during pandemics, supporting clinical decision-making through improved uncertainty estimation and relationship modeling.

Abstract: The COVID-19 pandemic disrupted healthcare systems worldwide, disproportionately impacting individuals with chronic conditions such as cardiovascular disease (CVD). These disruptions – through delayed care and behavioral changes, affected key CVD biomarkers, including LDL cholesterol (LDL-C), HbA1c, BMI, and systolic blood pressure (SysBP). Accurate modeling of these changes is crucial for predicting disease progression and guiding preventive care. However, prior work has not addressed multi-target prediction of CVD biomarker from Electronic Health Records (EHRs) using machine learning (ML), while jointly capturing biomarker interdependencies, temporal patterns, and predictive uncertainty. In this paper, we propose MBT-CB, a Multi-target Bayesian Transformer (MBT) with pre-trained BERT-based transformer framework to jointly predict LDL-C, HbA1c, BMI and SysBP CVD biomarkers from EHR data. The model leverages Bayesian Variational Inference to estimate uncertainties, embeddings to capture temporal relationships and a DeepMTR model to capture biomarker inter-relationships. We evaluate MBT-CT on retrospective EHR data from 3,390 CVD patient records (304 unique patients) in Central Massachusetts during the Covid-19 pandemic. MBT-CB outperformed a comprehensive set of baselines including other BERT-based ML models, achieving an MAE of 0.00887, RMSE of 0.0135 and MSE of 0.00027, while effectively capturing data and model uncertainty, patient biomarker inter-relationships, and temporal dynamics via its attention and embedding mechanisms. MBT-CB’s superior performance highlights its potential to improve CVD biomarker prediction and support clinical decision-making during pandemics.

[339] Test-Time Warmup for Multimodal Large Language Models

Nikita Rajaneesh, Thomas Zollo, Richard Zemel

Main category: cs.LG

TL;DR: Test-Time Warmup method adapts MLLMs per test instance using weakly supervised auxiliary tasks, improving performance on complex reasoning tasks without extensive fine-tuning.

Details

Motivation: MLLMs underperform on complex reasoning tasks despite massive pretraining, due to limited multimodal training data (thousands to millions vs billions for components).

Method: Proposes Test-Time Warmup that adapts MLLM per test instance using data from weakly supervised auxiliary tasks, rather than relying on extensive labeled datasets.

Result: Achieved relative performance improvements: 4.03% on MMMU, 5.28% on VQA-Rad, and 1.63% on GQA using Llama-Vision-Instruct model.

Conclusion: Warming up MLLMs before inference enhances robustness across diverse reasoning tasks, demonstrating effective adaptation without extensive fine-tuning.

Abstract: Multimodal Large Language Models (MLLMs) hold great promise for advanced reasoning at the intersection of text and images, yet they have not fully realized this potential. MLLMs typically integrate an LLM, a vision encoder, and a connector that maps the vision encoder’s embeddings into the LLM’s text embedding space. Although each component is pretrained on massive datasets with billions of samples, the entire multimodal model is typically trained on only thousands (or a few million) samples, which can result in weak performance on complex reasoning tasks. To address these shortcomings, instead of relying on extensive labeled datasets for fine-tuning, we propose a Test-Time Warmup method that adapts the MLLM per test instance by leveraging data from weakly supervised auxiliary tasks. With our approach, we observe a relative performance improvement of 4.03% on MMMU, 5.28% on VQA-Rad, and 1.63% on GQA on the Llama-Vision-Instruct model. Our method demonstrates that ‘warming up’ before inference can enhance MLLMs’ robustness across diverse reasoning tasks.

[340] HyperAdapt: Simple High-Rank Adaptation

Abel Gurung, Joseph Campbell

Main category: cs.LG

TL;DR: HyperAdapt is a parameter-efficient fine-tuning method that uses row- and column-wise scaling with diagonal matrices to achieve high-rank updates using only n+m parameters for an n×m matrix, outperforming LoRA while using significantly fewer parameters.

Details

Motivation: Foundation models require fine-tuning for specialized applications, but full fine-tuning is memory and compute-intensive. Parameter-efficient fine-tuning methods like LoRA help but still use more parameters than necessary.

Method: HyperAdapt adapts pre-trained weight matrices by applying row- and column-wise scaling through diagonal matrices, requiring only n+m trainable parameters for an n×m matrix while inducing high-rank updates.

Result: Experiments on GLUE, arithmetic reasoning, and commonsense reasoning benchmarks with models up to 14B parameters show HyperAdapt matches or nearly matches full fine-tuning and state-of-the-art PEFT methods while using orders of magnitude fewer parameters.

Conclusion: HyperAdapt provides an effective parameter-efficient fine-tuning method that achieves competitive performance with dramatically reduced parameter requirements, establishing theoretical bounds on update ranks and demonstrating practical effectiveness across diverse tasks.

Abstract: Foundation models excel across diverse tasks, but adapting them to specialized applications often requires fine-tuning, an approach that is memory and compute-intensive. Parameter-efficient fine-tuning (PEFT) methods mitigate this by updating only a small subset of weights. In this paper, we introduce HyperAdapt, a parameter-efficient fine-tuning method that significantly reduces the number of trainable parameters compared to state-of-the-art methods like LoRA. Specifically, HyperAdapt adapts a pre-trained weight matrix by applying row- and column-wise scaling through diagonal matrices, thereby inducing a high-rank update while requiring only $n+m$ trainable parameters for an $n \times m$ matrix. Theoretically, we establish an upper bound on the rank of HyperAdapt’s updates, and empirically, we confirm that it consistently induces high-rank transformations across model layers. Experiments on GLUE, arithmetic reasoning, and commonsense reasoning benchmarks with models up to 14B parameters demonstrate that HyperAdapt matches or nearly matches the performance of full fine-tuning and state-of-the-art PEFT methods while using orders of magnitude fewer trainable parameters.

[341] ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models

Jincheng Liu, Sijun He, Jingjing Wu, Xiangsen Wang, Yang Chen, Zhaoqi Kuang, Siqi Bao, Yuan Yao

Main category: cs.LG

TL;DR: LLMs struggle with genuine strategic reasoning in chess, failing to beat even amateur-level engines, but fine-tuning can significantly improve performance.

Details

Motivation: To determine if LLMs possess genuine strategic reasoning capabilities or just sophisticated pattern recognition, using chess as a testbed for complex reasoning.

Method: Created ChessArena - a competitive framework where LLMs play chess against each other in four modes, with ranking algorithms and evaluation of fine-grained capabilities like move selection and puzzle solving.

Result: Over 13 LLMs played 800+ games but none could beat Maia-1100 (amateur chess engine), some even lost to random players. Fine-tuned Qwen3-8B showed substantial improvement, approaching state-of-the-art reasoning models.

Conclusion: Current LLMs have significant shortcomings in strategic reasoning, but targeted fine-tuning can bridge the gap between pattern recognition and genuine reasoning capabilities.

Abstract: Recent large language models (LLMs) have shown strong reasoning capabilities. However, a critical question remains: do these models possess genuine reasoning skills particularly complex strategic reasoning or are they primarily excelling at sophisticated pattern recognition within their training data? To address this question, this paper presents a chess testbed, ChessArena, to evaluate the strategic reasoning capabilities of LLMs. Chess requires complex strategic reasoning capabilities including long-term planning, strict rule comprehension, and multi-turn conversation memorization. Specifically, ChessArena is a competitive framework where LLMs play against each other, under four different play modes. The testbed is equipped with a ranking algorithm and a leaderboard. The testbed can also evaluate fine-grained capabilities including basic understanding, move selection, and puzzle solving. Over 13 LLMs with different modes are evaluated in ChessArena, playing over 800 games. The results reveal significant shortcomings in current LLMs: no model can beat Maia-1100 (a chess engine at human amateur level), while some even failed to defeat a random player that selects moves arbitrarily. We also present a strong baseline to the testbed: our fine-tuned Qwen3-8B substantially improved performance, approaching much larger state-of-the-art reasoning models.

[342] Local Fragments, Global Gains: Subgraph Counting using Graph Neural Networks

Shubhajit Roy, Shrutimoy Das, Binita Maity, Anant Kumar, Anirban Dasgupta

Main category: cs.LG

TL;DR: The paper proposes localized versions of Weisfeiler-Leman algorithms (Local k-WL) for subgraph counting, which are more expressive than k-WL and more efficient than (k+1)-WL. It introduces scalable variants and a fragmentation technique for exact counting, plus a differentiable learning framework combining combinatorial algorithms with ML.

Details

Motivation: Subgraph counting is fundamental for analyzing structural patterns in graph data, with applications in computational biology and social networks where motifs reveal functional properties. Current methods face limitations in expressivity and computational efficiency.

Method: Proposes Local k-WL algorithms with variants (Layer k-WL, Recursive k-WL) for better efficiency. Introduces fragmentation technique to decompose subgraphs into simpler patterns. Develops three-stage differentiable learning framework combining subpattern counts.

Result: Local k-WL is proven more expressive than k-WL and at most as expressive as (k+1)-WL. Enables exact count of induced subgraphs up to size 4 using only 1-WL. Methods achieve greater time/space efficiency and are more expressive than prior approaches under bounded complexity.

Conclusion: The localized WL algorithms provide improved expressivity and efficiency for subgraph counting, bridging combinatorial algorithm design with machine learning. The framework enables scalable motif analysis with theoretical guarantees on expressive power.

Abstract: Subgraph counting is a fundamental task for analyzing structural patterns in graph-structured data, with important applications in domains such as computational biology and social network analysis, where recurring motifs reveal functional and organizational properties. In this paper, we propose localized versions of the Weisfeiler-Leman (WL) algorithms to improve both expressivity and computational efficiency for this task. We introduce Local $k$-WL, which we prove to be more expressive than $k$-WL and at most as expressive as $(k+1)$-WL, and provide a characterization of patterns whose subgraph and induced subgraph counts are invariant under Local $k$-WL equivalence. To enhance scalability, we present two variants – Layer $k$-WL and Recursive $k$-WL – that achieve greater time and space efficiency compared to applying $k$-WL on the entire graph. Additionally, we propose a novel fragmentation technique that decomposes complex subgraphs into simpler subpatterns, enabling the exact count of all induced subgraphs of size at most $4$ using only $1$-WL, with extensions possible for larger patterns when $k>1$. Building on these ideas, we develop a three-stage differentiable learning framework that combines subpattern counts to compute counts of more complex motifs, bridging combinatorial algorithm design with machine learning approaches. We also compare the expressive power of Local $k$-WL with existing GNN hierarchies and demonstrate that, under bounded time complexity, our methods are more expressive than prior approaches.

[343] Generalizing Graph Transformers Across Diverse Graphs and Tasks via Pre-training

Yufei He, Zhenyu Hou, Yukuo Cen, Jun Hu, Feng He, Xu Cheng, Jie Tang, Bryan Hooi

Main category: cs.LG

TL;DR: PGT is a scalable transformer-based graph pre-training framework that uses masked autoencoder architecture with feature and structure reconstruction tasks, achieving SOTA performance on large graphs with over 540M nodes.

Details

Motivation: Extending graph pre-trained models to web-scale graphs with billions of nodes while avoiding negative transfer and enabling inductive predictions for unseen nodes/graphs.

Method: Masked autoencoder architecture with two pre-training tasks: node feature reconstruction and local structure reconstruction, plus decoder-based feature augmentation strategy.

Result: Achieved state-of-the-art performance on ogbn-papers100M (111M nodes, 1.6B edges) and successfully deployed on Tencent’s game data (540M nodes, 12B edges) with effective generalization across static and dynamic tasks.

Conclusion: PGT demonstrates scalability, efficiency, and generalization capability for large-scale graph pre-training in industrial scenarios.

Abstract: Graph pre-training has been concentrated on graph-level tasks involving small graphs (e.g., molecular graphs) or learning node representations on a fixed graph. Extending graph pre-trained models to web-scale graphs with billions of nodes in industrial scenarios, while avoiding negative transfer across graphs or tasks, remains a challenge. We aim to develop a general graph pre-trained model with inductive ability that can make predictions for unseen new nodes and even new graphs. In this work, we introduce a scalable transformer-based graph pre-training framework called PGT (Pre-trained Graph Transformer). Based on the masked autoencoder architecture, we design two pre-training tasks: one for reconstructing node features and the other for reconstructing local structures. Unlike the original autoencoder architecture where the pre-trained decoder is discarded, we propose a novel strategy that utilizes the decoder for feature augmentation. Our framework, tested on the publicly available ogbn-papers100M dataset with 111 million nodes and 1.6 billion edges, achieves state-of-the-art performance, showcasing scalability and efficiency. We have deployed our framework on Tencent’s online game data, confirming its capability to pre-train on real-world graphs with over 540 million nodes and 12 billion edges and to generalize effectively across diverse static and dynamic downstream tasks.

[344] FedQUIT: On-Device Federated Unlearning via a Quasi-Competent Virtual Teacher

Alessio Mora, Lorenzo Valerio, Paolo Bellavista, Andrea Passarella

Main category: cs.LG

TL;DR: FedQUIT enables efficient data removal in Federated Learning using knowledge distillation, allowing clients to be forgotten while maintaining model performance without requiring historical updates or proxy datasets.

Details

Motivation: FL participants should have the right to be forgotten, ensuring their data contributions can be removed from models upon request, which is challenging in distributed settings without centralized data storage.

Method: Uses knowledge distillation with FL global model as teacher and local model as student. Tailors teacher’s output on local data by penalizing true class prediction scores to induce forgetting.

Result: Outperforms state-of-the-art in forgetting data, has same computational requirements as FedAvg round, reduces communication costs by up to 117.6x compared to retraining from scratch.

Conclusion: FedQUIT provides an efficient and practical solution for data removal in FL that preserves model generalization while respecting user privacy rights.

Abstract: Federated Learning (FL) systems enable the collaborative training of machine learning models without requiring centralized collection of individual data. FL participants should have the ability to exercise their right to be forgotten, ensuring their past contributions can be removed from the learned model upon request. In this paper, we propose FedQUIT, a novel algorithm that uses knowledge distillation to scrub the contribution of the data to forget from an FL global model while preserving its generalization ability. FedQUIT directly works on client devices that request to leave the federation, and leverages a teacher-student framework. The FL global model acts as the teacher, and the local model works as the student. To induce forgetting, FedQUIT tailors the teacher’s output on local data (the data to forget) penalizing the prediction score of the true class. Unlike previous work, our method does not require hardly viable assumptions for cross-device settings, such as storing historical updates of participants or requiring access to proxy datasets. Experimental results on various datasets and model architectures demonstrate that (i) FedQUIT outperforms state-of-the-art competitors in forgetting data, (ii) has the exact computational requirements as a regular FedAvg round, and (iii) reduces the cumulative communication costs by up to 117.6$\times$ compared to retraining from scratch to restore the initial generalization performance after unlearning.

[345] Integrating Sequential and Relational Modeling for User Events: Datasets and Prediction Tasks

Rizal Fathony, Igor Melnyk, Owen Reinert, Nam H. Nguyen, Daniele Rosa, C. Bayan Bruss

Main category: cs.LG

TL;DR: This paper introduces datasets and methods for unified modeling of both personal and relational user events, showing that combining both event types improves performance over single-type approaches.

Details

Motivation: User events are typically modeled separately as sequences (personal events) or graphs (relational events), but real-world systems need to capture both types together, which hasn't been adequately addressed in prior work.

Method: The authors introduce public datasets that incorporate both personal and relational events, propose a unified formalization, and empirically evaluate models that incorporate both event types.

Result: Models benefit from incorporating both personal and relational events, though current methods still leave significant room for improvement.

Conclusion: The resources released support further research in unified user event modeling, encouraging progress in combining both event types for better user behavior modeling.

Abstract: User event modeling plays a central role in many machine learning applications, with use cases spanning e-commerce, social media, finance, cybersecurity, and other domains. User events can be broadly categorized into personal events, which involve individual actions, and relational events, which involve interactions between two users. These two types of events are typically modeled separately, using sequence-based methods for personal events and graph-based methods for relational events. Despite the need to capture both event types in real-world systems, prior work has rarely considered them together. This is often due to the convenient simplification that user behavior can be adequately represented by a single formalization, either as a sequence or a graph. To address this gap, there is a need for public datasets and prediction tasks that explicitly incorporate both personal and relational events. In this work, we introduce a collection of such datasets, propose a unified formalization, and empirically show that models benefit from incorporating both event types. Our results also indicate that current methods leave a notable room for improvements. We release these resources to support further research in unified user event modeling and encourage progress in this direction.

[346] Diffusion & Adversarial Schrödinger Bridges via Iterative Proportional Markovian Fitting

Sergei Kholkin, Grigoriy Ksenofontov, David Li, Nikita Kornilov, Nikita Gushchin, Alexandra Suvorikova, Alexey Kroshnin, Evgeny Burnaev, Alexander Korotin

Main category: cs.LG

TL;DR: The paper introduces Iterative Proportional Markovian Fitting (IPMF), which combines Iterative Markovian Fitting (IMF) and Iterative Proportional Fitting (IPF) to solve Schrödinger Bridge problems, offering improved stability and flexible trade-offs between image similarity and generation quality.

Details

Motivation: To address the instability of the original IMF procedure for solving Schrödinger Bridge problems and to develop a unified framework that integrates IMF and IPF methods for more reliable performance in applications like unpaired domain translation.

Method: Proposes IPMF procedure that alternates between fitting forward and backward time diffusion at each iteration, effectively combining IMF and IPF approaches to stabilize training and improve convergence.

Result: Establishes theoretical and empirical convergence of IPMF under various settings, demonstrating its effectiveness in providing flexible trade-offs between image similarity and generation quality for practical applications.

Conclusion: IPMF provides a unified framework for solving Schrödinger Bridge problems with improved stability and practical flexibility, enabling tailored models for specific tasks through controlled trade-offs between different quality metrics.

Abstract: The Iterative Markovian Fitting (IMF) procedure, which iteratively projects onto the space of Markov processes and the reciprocal class, successfully solves the Schr"odinger Bridge (SB) problem. However, an efficient practical implementation requires a heuristic modification – alternating between fitting forward and backward time diffusion at each iteration. This modification is crucial for stabilizing training and achieving reliable results in applications such as unpaired domain translation. Our work reveals a close connection between the modified version of IMF and the Iterative Proportional Fitting (IPF) procedure – a foundational method for the SB problem, also known as Sinkhorn’s algorithm. Specifically, we demonstrate that the heuristic modification of the IMF effectively integrates both IMF and IPF procedures. We refer to this combined approach as the Iterative Proportional Markovian Fitting (IPMF) procedure. Through theoretical and empirical analysis, we establish the convergence of the IPMF procedure under various settings, contributing to developing a unified framework for solving SB problems. Moreover, from a practical standpoint, the IPMF procedure enables a flexible trade-off between image similarity and generation quality, offering a new mechanism for tailoring models to specific tasks.

[347] Deep Edge Filter: Return of the Human-Crafted Layer in Deep Learning

Dongkwan Lee, Junhoo Lee, Nojun Kwak

Main category: cs.LG

TL;DR: Deep Edge Filter applies high-pass filtering to neural network features to improve generalization by isolating high-frequency semantic components while removing low-frequency domain biases.

Details

Motivation: The hypothesis that neural networks encode task-relevant semantic information in high-frequency components and domain-specific biases in low-frequency components of deep features.

Method: Subtracting low-pass filtered outputs from original features to isolate generalizable representations while preserving architectural integrity.

Result: Consistent performance improvements across diverse domains (Vision, Text, 3D, Audio) regardless of model architecture and data modality. Analysis shows feature sparsification and effective isolation of high-frequency components.

Conclusion: The method empirically validates the core hypothesis and provides a generalizable approach for improving model performance across various domains and architectures.

Abstract: We introduce the Deep Edge Filter, a novel approach that applies high-pass filtering to deep neural network features to improve model generalizability. Our method is motivated by our hypothesis that neural networks encode task-relevant semantic information in high-frequency components while storing domain-specific biases in low-frequency components of deep features. By subtracting low-pass filtered outputs from original features, our approach isolates generalizable representations while preserving architectural integrity. Experimental results across diverse domains such as Vision, Text, 3D, and Audio demonstrate consistent performance improvements regardless of model architecture and data modality. Analysis reveals that our method induces feature sparsification and effectively isolates high-frequency components, providing empirical validation of our core hypothesis. The code is available at https://github.com/dongkwani/DeepEdgeFilter.

[348] Small Singular Values Matter: A Random Matrix Analysis of Transformer Models

Max Staats, Matthias Thamm, Bernd Rosenow

Main category: cs.LG

TL;DR: Analysis of transformer weight matrices reveals that both large AND small singular values contain learned information, challenging conventional wisdom that only large singular values matter for model performance.

Details

Motivation: To understand how information is stored across the entire singular-value spectrum in transformer models, particularly investigating whether small singular values contain meaningful learned information rather than just noise.

Method: Used Random Matrix Theory as a baseline for randomness, analyzed deviations in both large and small singular values, compared singular vectors with activation covariance eigenvectors, and tested impact through selective zeroing experiments.

Result: Found significant deviations from RMT in both large AND small singular values. Small singular values capture important data directions and their removal substantially increases perplexity. After fine-tuning, the smallest decile becomes the third most influential part of the spectrum.

Conclusion: Small singular values contain crucial learned information, challenging the conventional focus on large singular values only. This provides new theoretical understanding and practical guidance for SVD-based model compression and pruning.

Abstract: This work analyzes singular-value spectra of weight matrices in pretrained transformer models to understand how information is stored at both ends of the spectrum. Using Random Matrix Theory (RMT) as a zero information hypothesis, we associate agreement with RMT as evidence of randomness and deviations as evidence for learning. Surprisingly, we observe pronounced departures from RMT not only among the largest singular values – the usual outliers – but also among the smallest ones. A comparison of the associated singular vectors with the eigenvectors of the activation covariance matrices shows that there is considerable overlap wherever RMT is violated. Thus, significant directions in the data are captured by small singular values and their vectors as well as by the large ones. We confirm this empirically: zeroing out the singular values that deviate from RMT raises language-model perplexity far more than removing values from the bulk, and after fine-tuning the smallest decile can be the third most influential part of the spectrum. To explain how vectors linked to small singular values can carry more information than those linked to larger values, we propose a linear random-matrix model. Our findings highlight the overlooked importance of the low end of the spectrum and provide theoretical and practical guidance for SVD-based pruning and compression of large language models.

[349] Rewarding the Journey, Not Just the Destination: A Composite Path and Answer Self-Scoring Reward Mechanism for Test-Time Reinforcement Learning

Chenwei Tang, Jingyu Xing, Xinyu Liu, Wei Ju, Jiancheng Lv, Fan Zhang, Deng Xiong, Ziyue Qiao

Main category: cs.LG

TL;DR: COMPASS introduces a novel test-time reward mechanism for reinforcement learning on unlabeled data, using dual-calibration answer rewards and decisive path rewards to enhance LLM reasoning capabilities without external supervision.

Details

Motivation: Current RL methods for LLMs face scalability bottlenecks due to reliance on human-curated preference data or labeled datasets for reward modeling. The paper aims to overcome this by enabling RL on unlabeled data through autonomous learning from continuous experience streams.

Method: COMPASS integrates two components: 1) Dual-Calibration Answer Reward (DCAR) - stabilizes training by establishing trustworthy pseudo-labels through confidence and credibility calibration, and 2) Decisive Path Reward (DPR) - directly optimizes reasoning process quality beyond outcome supervision. The method jointly reinforces trustworthy consensus answers and highly decisive reasoning chains.

Result: Extensive experiments show that COMPASS achieves significant and consistent performance gains across diverse reasoning tasks and model architectures, demonstrating effectiveness in enhancing LLM analytical capabilities.

Conclusion: COMPASS advances a more scalable direction for LLMs to learn from continuous experience by providing reliable reward estimation without ground-truth supervision, overcoming fundamental scalability bottlenecks in current RL methods.

Abstract: Reinforcement Learning (RL) has emerged as a powerful paradigm for advancing Large Language Models (LLMs), achieving remarkable performance in complex reasoning domains such as mathematics and code generation. However, current RL methods face a fundamental scalability bottleneck due to their heavy reliance on human-curated preference data or labeled datasets for reward modeling. To overcome this limitation, we explore RL on unlabeled data where models learn autonomously from continuous experience streams. The core challenge in this setting lies in reliable reward estimation without ground-truth supervision. Existing approaches like Test-Time RL address this through self-consistent consensus, but risk reinforcing incorrect pseudo-labels derived from majority voting. We introduce COMPASS (Composite Path and Answer Self-Scoring), a novel test-time reward mechanism that operates without external supervision. COMPASS integrates two complementary components: the Dual-Calibration Answer Reward (DCAR), which stabilizes training by establishing trustworthy pseudo-labels through confidence and credibility calibration, and the Decisive Path Reward (DPR), which directly optimizes the reasoning process quality beyond mere outcome supervision. By jointly reinforcing trustworthy consensus answers and highly decisive reasoning chains, the COMPASS systematically enhances the model’s analytical capabilities. Extensive experiments show that COMPASS achieves significant and consistent performance gains across diverse reasoning tasks and model architectures, advancing a more scalable direction for LLMs to learn from continuous experience.

[350] Shallow Diffuse: Robust and Invisible Watermarking through Low-Dimensional Subspaces in Diffusion Models

Wenda Li, Huijie Zhang, Qing Qu

Main category: cs.LG

TL;DR: Shallow Diffuse is a new watermarking technique for diffusion models that embeds robust, invisible watermarks by leveraging a low-dimensional subspace in image generation, decoupling watermarking from the diffusion process for better consistency and detectability.

Details

Motivation: The widespread use of AI-generated content from diffusion models raises concerns about misinformation and copyright infringement, making watermarking essential for identifying and preventing misuse of AI-generated images.

Method: Shallow Diffuse embeds watermarks by leveraging a low-dimensional subspace in image generation, placing most of the watermark in the null space of this subspace to decouple watermarking from the diffusion sampling process.

Result: Extensive experiments show Shallow Diffuse outperforms existing watermarking methods in robustness and consistency, with theoretical and empirical analyses confirming enhanced data generation consistency and watermark detectability.

Conclusion: The decoupling strategy in Shallow Diffuse effectively separates watermarking from image generation, providing superior watermarking performance for diffusion models while maintaining image quality.

Abstract: The widespread use of AI-generated content from diffusion models has raised significant concerns regarding misinformation and copyright infringement. Watermarking is a crucial technique for identifying these AI-generated images and preventing their misuse. In this paper, we introduce Shallow Diffuse, a new watermarking technique that embeds robust and invisible watermarks into diffusion model outputs. Unlike existing approaches that integrate watermarking throughout the entire diffusion sampling process, Shallow Diffuse decouples these steps by leveraging the presence of a low-dimensional subspace in the image generation process. This method ensures that a substantial portion of the watermark lies in the null space of this subspace, effectively separating it from the image generation process. Our theoretical and empirical analyses show that this decoupling strategy greatly enhances the consistency of data generation and the detectability of the watermark. Extensive experiments further validate that our Shallow Diffuse outperforms existing watermarking methods in terms of robustness and consistency. The codes are released at https://github.com/liwd190019/Shallow-Diffuse.

[351] Scaling Laws for Task-Optimized Models of the Primate Visual Ventral Stream

Abdulkadir Gokce, Martin Schrimpf

Main category: cs.LG

TL;DR: Scaling model size improves behavioral alignment with primate object recognition but neural alignment saturates, suggesting current scaling approaches won’t yield better brain models.

Details

Motivation: To understand how scaling compute, model size, and dataset size affects brain alignment in neural network models of primate visual processing.

Method: Systematically evaluated over 600 models trained under controlled conditions on benchmarks spanning V1, V2, V4, IT and behavior across different architectures and datasets.

Result: Behavioral alignment continues to scale with larger models, but neural alignment saturates. This pattern holds across architectures and datasets, with scaling being especially beneficial for higher-level visual areas.

Conclusion: Scaling current architectures and datasets suffices for behavioral alignment but won’t improve models of the brain’s visual ventral stream, highlighting the need for novel brain modeling strategies.

Abstract: When trained on large-scale object classification datasets, certain artificial neural network models begin to approximate core object recognition behaviors and neural response patterns in the primate brain. While recent machine learning advances suggest that scaling compute, model size, and dataset size improves task performance, the impact of scaling on brain alignment remains unclear. In this study, we explore scaling laws for modeling the primate visual ventral stream by systematically evaluating over 600 models trained under controlled conditions on benchmarks spanning V1, V2, V4, IT and behavior. We find that while behavioral alignment continues to scale with larger models, neural alignment saturates. This observation remains true across model architectures and training datasets, even though models with stronger inductive biases and datasets with higher-quality images are more compute-efficient. Increased scaling is especially beneficial for higher-level visual areas, where small models trained on few samples exhibit only poor alignment. Our results suggest that while scaling current architectures and datasets might suffice for alignment with human core object recognition behavior, it will not yield improved models of the brain’s visual ventral stream, highlighting the need for novel strategies in building brain models.

[352] ADPO: Anchored Direct Preference Optimization

Wang Zixian

Main category: cs.LG

TL;DR: ADPO extends DPO with soft listwise supervision via reference anchoring, handling noise and distribution shift better while unifying several learning approaches as special cases.

Details

Motivation: DPO is brittle under annotator noise and distribution shift due to hard pairwise labels and limited regularization of log-probability differences.

Method: ADPO minimizes KL divergence between target distribution and softmax of score differences with anchor policy, supporting both fixed and dynamic anchors.

Result: ADPO achieves up to 170-5000x reduction in student-teacher KL divergence, with dynamic anchors improving online exploration under noise and fixed anchors excelling at offline distillation.

Conclusion: ADPO provides a flexible framework that unifies multiple learning approaches and offers significant improvements over DPO, with task-dependent anchor strategies.

Abstract: Direct Preference Optimization (DPO) is effective but brittle under annotator noise and distribution shift because it operates on hard, pairwise labels and only regularizes log-probability differences. We introduce Anchored Direct Preference Optimization (ADPO), a framework that extends preference learning to soft listwise supervision via reference anchoring. ADPO minimizes KL(q || softmax((s - s_ref) / tau_anc)), which (i) recovers supervised fine-tuning, knowledge distillation, maximum-entropy reinforcement learning, and DPO as special cases through suitable choices of target q, anchor policy, and temperature; (ii) induces an implicit trust region governed by the softmax Fisher metric, independent of the anchor; and (iii) supports stable dynamic-anchor updates. Empirically, we observe a task-dependent tradeoff: dynamic anchors improve online exploration under noise, while fixed anchors excel at offline distillation, achieving up to 170 to 5000 times reduction in student-teacher KL on our benchmarks.

[353] scMEDAL for the interpretable analysis of single-cell transcriptomics data with batch effect visualization using a deep mixed effects autoencoder

Aixa X. Andrade, Son Nguyen, Austin Marckx, Albert Montillo

Main category: cs.LG

TL;DR: scMEDAL is a novel single-cell RNA sequencing batch correction framework that separates batch-invariant and batch-specific effects using two complementary subnetworks: scMEDAL-RE (random-effects Bayesian autoencoder) and scMEDAL-FE (fixed-effects subnetwork).

Details

Motivation: Existing batch-correction algorithms suppress or discard batch-related variation rather than modeling it, often losing biologically meaningful information confounded with batch effects.

Method: Uses two subnetworks: scMEDAL-RE (random-effects Bayesian autoencoder) learns batch-specific representations while preserving biological information, and scMEDAL-FE (fixed-effects subnetwork) trained via adversarial learning provides default batch correction.

Result: scMEDAL-RE produces interpretable, batch-specific embeddings that complement existing correction methods, yielding more accurate prediction of disease status, donor group, and tissue across diverse conditions (autism, leukemia, cardiovascular).

Conclusion: scMEDAL is a versatile, interpretable framework that complements existing correction methods, providing enhanced insight into cellular heterogeneity and data acquisition through generative visualizations and counterfactual reconstructions.

Abstract: Single-cell RNA sequencing enables high-resolution analysis of cellular heterogeneity, yet disentangling biological signal from batch effects remains a major challenge. Existing batch-correction algorithms suppress or discard batch-related variation rather than modeling it. We propose scMEDAL, single-cell Mixed Effects Deep Autoencoder Learning, a framework that separately models batch-invariant and batch-specific effects using two complementary subnetworks. The principal innovation, scMEDAL-RE, is a random-effects Bayesian autoencoder that learns batch-specific representations while preserving biologically meaningful information confounded with batch effects signal often lost under standard correction. Complementing it, the fixed-effects subnetwork, scMEDAL-FE, trained via adversarial learning provides a default batch-correction component. Evaluations across diverse conditions (autism, leukemia, cardiovascular), cell types, and technical and biological effects show that scMEDAL-RE produces interpretable, batch-specific embeddings that complement both scMEDAL-FE and established correction methods (scVI, Scanorama, Harmony, SAUCIE), yielding more accurate prediction of disease status, donor group, and tissue. scMEDAL also provides generative visualizations, including counterfactual reconstructions of a cell’s expression as if acquired in another batch. The framework allows substitution of the fixed-effects component with other correction methods, while retaining scMEDAL-RE’s enhanced predictive power and visualization. Overall, scMEDAL is a versatile, interpretable framework that complements existing correction, providing enhanced insight into cellular heterogeneity and data acquisition.

[354] TowerVision: Understanding and Improving Multilinguality in Vision-Language Models

André G. Viveiros, Patrick Fernandes, Saul Santos, Sonal Sannigrahi, Emmanouil Zaranis, Nuno M. Guerreiro, Amin Farajian, Pierre Colombo, Graham Neubig, André F. T. Martins

Main category: cs.LG

TL;DR: TowerVision is a family of open multilingual vision-language models that achieves competitive performance on multimodal multilingual benchmarks, particularly excelling in culturally grounded tasks and multimodal translation.

Details

Motivation: Most existing vision-language models follow an English-centric design process, limiting their effectiveness in multilingual settings. This work aims to address this limitation by creating multilingual VLMs.

Method: Comprehensive empirical study analyzing multilingual design choices including training data composition, encoder selection, and text backbones. Built upon the multilingual text-only model Tower+ and fine-tuned with visual and cultural context.

Result: TowerVision achieves competitive performance on multiple multimodal multilingual benchmarks, surpassing existing approaches trained on larger datasets. Shows particular strength in culturally grounded tasks and multimodal translation on ALM-Bench, Multi30K (image tasks) and ViMUL-Bench (video tasks).

Conclusion: Multilingual vision-language training data substantially improves cross-lingual generalization, and instruction-tuned LLMs are not always the optimal initialization point. The models, data, and training recipes are publicly released to support further research.

Abstract: Despite significant advances in vision-language models (VLMs), most existing work follows an English-centric design process, limiting their effectiveness in multilingual settings. In this work, we provide a comprehensive empirical study analyzing the impact of several multilingual design choices, such as training data composition, encoder selection, and text backbones. The result is TowerVision, a family of open multilingual VLMs for both image-text and video-text tasks, built upon the multilingual text-only model Tower+. TowerVision achieves competitive performance on multiple multimodal multilingual benchmarks and shows particular strength in culturally grounded tasks and multimodal translation. By incorporating visual and cultural context during fine-tuning, our models surpass existing approaches trained on substantially larger datasets, as demonstrated on ALM-Bench and Multi30K (image tasks) and ViMUL-Bench (video tasks). Alongside the models, we release VisionBlocks, a high-quality, curated vision-language dataset. Our findings highlight that multilingual vision-language training data substantially improves cross-lingual generalization – both from high-resource to underrepresented languages and vice versa – and that instruction-tuned LLMs are not always the optimal initialization point. To support further research, we publicly release all models, data, and training recipes.

[355] AnomalyAID: Reliable Interpretation for Semi-supervised Network Anomaly Detection

Yachao Yuan, Yu Huang, Yingwen Wu, Jin Wang

Main category: cs.LG

TL;DR: AnomalyAID is a semi-supervised framework for network anomaly detection that provides interpretable explanations and improves detection performance using pseudo-labeling with limited labeled data.

Details

Motivation: Semi-supervised learning is crucial for network anomaly detection but faces challenges with limited labeled samples and lack of interpretability, creating barriers to practical adoption.

Method: Proposes a novel interpretation approach using global and local interpreters for reliable explanations, and a two-stage semi-supervised learning framework with prediction alignment constraints for pseudo-labeling.

Result: Experimental results show AnomalyAID provides accurate detection results with reliable interpretations for semi-supervised network anomaly detection systems.

Conclusion: AnomalyAID successfully addresses interpretability and performance challenges in semi-supervised network anomaly detection, demonstrating effectiveness across multiple detection tasks.

Abstract: Semi-supervised Learning plays a crucial role in network anomaly detection applications, however, learning anomaly patterns with limited labeled samples is not easy. Additionally, the lack of interpretability creates key barriers to the adoption of semi-supervised frameworks in practice. Most existing interpretation methods are developed for supervised/unsupervised frameworks or non-security domains and fail to provide reliable interpretations. In this paper, we propose AnomalyAID, a general framework aiming to (1) make the anomaly detection process interpretable and improve the reliability of interpretation results, and (2) assign high-confidence pseudo labels to unlabeled samples for improving the performance of anomaly detection systems with limited supervised data. For (1), we propose a novel interpretation approach that leverages global and local interpreters to provide reliable explanations, while for (2), we design a new two-stage semi-supervised learning framework for network anomaly detection by aligning both stages’ model predictions with special constraints. We apply AnomalyAID over two representative network anomaly detection tasks and extensively evaluate AnomalyAID with representative prior works. Experimental results demonstrate that AnomalyAID can provide accurate detection results with reliable interpretations for semi-supervised network anomaly detection systems. The code is available at: https://github.com/M-Code-Space/AnomalyAID.

[356] Non-Convex Over-the-Air Heterogeneous Federated Learning: A Bias-Variance Trade-off

Muhammad Faraz Ul Abrar, Nicolò Michelusi

Main category: cs.LG

TL;DR: This paper proposes a novel over-the-air federated learning approach that allows structured model bias under heterogeneous wireless conditions, optimizing the bias-variance trade-off for non-convex objectives using successive convex approximation.

Details

Motivation: Existing OTA-FL designs enforce zero-bias updates under homogeneous wireless assumptions, which constrains performance under heterogeneous conditions and inflates update variance. Prior biased OTA-FL analyses focus on convex objectives, while modern AI models are non-convex.

Method: Developed OTA-FL SGD updates that allow structured, time-invariant model bias while reducing variance. Proposed a non-convex joint OTA power-control design optimized using successive convex approximation (SCA) algorithm requiring only statistical CSI.

Result: The approach achieves a finite-time stationarity bound revealing bias-variance trade-off. Experiments on non-convex image classification show accelerated convergence via optimized bias and improved generalization over prior OTA-FL baselines.

Conclusion: The proposed OTA-FL framework with structured bias and optimized power control effectively handles wireless heterogeneity and non-convex objectives, outperforming existing approaches in convergence speed and generalization performance.

Abstract: Over-the-air (OTA) federated learning (FL) has been well recognized as a scalable paradigm that exploits the waveform superposition of the wireless multiple-access channel to aggregate model updates in a single use. Existing OTA-FL designs largely enforce zero-bias model updates by either assuming \emph{homogeneous} wireless conditions (equal path loss across devices) or forcing zero-bias updates to guarantee convergence. Under \emph{heterogeneous} wireless scenarios, however, such designs are constrained by the weakest device and inflate the update variance. Moreover, prior analyses of biased OTA-FL largely address convex objectives, while most modern AI models are highly non-convex. Motivated by these gaps, we study OTA-FL with stochastic gradient descent (SGD) for general smooth non-convex objectives under wireless heterogeneity. We develop novel OTA-FL SGD updates that allow a structured, time-invariant model bias while facilitating reduced variance updates. We derive a finite-time stationarity bound (expected time average squared gradient norm) that explicitly reveals a bias-variance trade-off. To optimize this trade-off, we pose a non-convex joint OTA power-control design and develop an efficient successive convex approximation (SCA) algorithm that requires only statistical CSI at the base station. Experiments on a non-convex image classification task validate the approach: the SCA-based design accelerates convergence via an optimized bias and improves generalization over prior OTA-FL baselines.

[357] Revisiting Federated Fine-Tuning: A Single Communication Round is Enough for Foundation Models

Ziyao Wang, Bowei Tian, Yexiao He, Zheyu Shen, Guoheng Sun, Yuhan Liu, Luyang Liu, Meng Liu, Ang Li

Main category: cs.LG

TL;DR: One-shot federated fine-tuning achieves comparable performance to multi-round approaches while significantly reducing communication costs for large foundation models.

Details

Motivation: Traditional federated fine-tuning of foundation models suffers from prohibitively high communication costs due to large parameter sizes and multi-round communication requirements.

Method: Proposes and analyzes one-shot federated fine-tuning (single round aggregation) for large foundation models, with theoretical and empirical validation comparing it to multi-round approaches.

Result: One-shot federated fine-tuning achieves global model performance comparable to multi-round aggregation while significantly reducing communication costs, enabling asynchronous aggregation, and enhancing privacy.

Conclusion: One-shot federated fine-tuning revolutionizes federated fine-tuning by enhancing efficiency, reducing costs, and expanding accessibility for foundation models while maintaining performance consistency.

Abstract: The recent advancement of foundation models (FMs) has increased the demand for fine-tuning these models on large-scale cross-domain datasets. To address this, federated fine-tuning has emerged, allowing FMs to be fine-tuned on distributed datasets across multiple devices while ensuring data privacy. However, the substantial parameter size and the multi-round communication in federated learning algorithms result in prohibitively high communication costs, challenging the practicality of federated fine-tuning. In this paper, we identify and analyze, both theoretically and empirically, that the traditional multi-round aggregation algorithms may not be necessary for federated fine-tuning large FMs. Our experiments reveal that a single round of aggregation (i.e., one-shot federated fine-tuning) yields a global model performance comparable to that achieved through multiple rounds of aggregation. Through rigorous mathematical and empirical analyses, we demonstrate that large FMs, due to their extensive parameter sizes and pre-training on general tasks, achieve significantly lower training loss in one-shot federated fine-tuning compared to smaller models. Our extensive experiments show that one-shot federated fine-tuning significantly reduces communication costs. It also has the potential to enable asynchronous aggregation, enhances privacy, and maintains performance consistency with multi-round federated fine-tuning on both text generation and text-to-image generation tasks. Our findings provide insights to revolutionize federated fine-tuning in practice, enhancing efficiency, reducing costs, and expanding accessibility for FMs.

[358] A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers

William Merrill, Ashish Sabharwal

Main category: cs.LG

TL;DR: Transformers with depth growing logarithmically with input length can solve regular language recognition and graph connectivity problems, which fixed-depth transformers cannot express under standard complexity conjectures.

Details

Motivation: To understand how transformer depth affects expressive power for sequential reasoning, particularly whether bounded depth suffices for short inputs and how increasing depth improves expressivity.

Method: Theoretical analysis of transformers with depth scaling as Θ(log n) with input length n, examining their ability to express regular languages and graph connectivity problems.

Result: Transformers with Θ(log n) depth can express regular language recognition and graph connectivity, while fixed-depth transformers cannot. Empirical experiments show theoretical depth requirements closely match practical training needs.

Conclusion: Growing transformer depth with input length significantly enhances reasoning capabilities, with logarithmic scaling being more efficient than width scaling or chain-of-thought steps for sequential reasoning tasks.

Abstract: Recent theoretical results show transformers cannot express sequential reasoning problems over long inputs, intuitively because their computational depth is bounded. However, prior work treats the depth as a constant, leaving it unclear to what degree bounded depth may suffice for solving problems over short inputs, or how increasing the transformer’s depth affects its expressive power. We address these questions by analyzing transformers whose depth can grow minimally with context length $n$. We show even highly uniform transformers with depth $\Theta(\log n)$ can express two important problems: recognizing regular languages, which captures state tracking abilities and was known to be expressible only by an unconventional, non-uniform model of transformers, and graph connectivity, which underlies multi-step reasoning. Notably, both of these problems cannot be expressed by fixed-depth transformers under standard complexity conjectures, demonstrating the expressivity benefit of growing depth. Moreover, our theory quantitatively predicts how depth must grow with input length to express these problems, showing that depth scaling is more efficient than scaling width or chain-of-thought steps. Empirically, our detailed experiments designed to bridge the expressivity vs. learnability gap reveal that our theoretical depth requirements for regular language recognition closely match the practical depth requirements for successfully training transformers. Thus, our results clarify how depth affects a transformer’s reasoning capabilities, and provide practical guidance for effective depth selection for sequential reasoning.

[359] Explanations Go Linear: Interpretable and Individual Latent Encoding for Post-hoc Explainability

Simone Piaggesi, Riccardo Guidotti, Fosca Giannotti, Dino Pedreschi

Main category: cs.LG

TL;DR: ILLUME is a flexible framework that combines global surrogates with instance-specific linear transformations to provide both local and global explanations for black-box classifiers, addressing limitations of traditional surrogate methods.

Details

Motivation: Traditional surrogate-based explainability methods have significant limitations - local surrogates are computationally expensive and parameter-sensitive, while global surrogates struggle with complex local behaviors.

Method: Combines a globally trained surrogate with instance-specific linear transformations learned using a meta-encoder to generate both local and global explanations.

Result: Extensive empirical evaluations show ILLUME produces feature attributions and decision rules that are accurate, robust, and faithful to the black-box model.

Conclusion: ILLUME provides a unified explanation framework that effectively addresses the limitations of traditional surrogate methods for black-box model interpretability.

Abstract: Post-hoc explainability is essential for understanding black-box machine learning models. Surrogate-based techniques are widely used for local and global model-agnostic explanations but have significant limitations. Local surrogates capture non-linearities but are computationally expensive and sensitive to parameters, while global surrogates are more efficient but struggle with complex local behaviors. In this paper, we present ILLUME, a flexible and interpretable framework grounded in representation learning, that can be integrated with various surrogate models to provide explanations for any black-box classifier. Specifically, our approach combines a globally trained surrogate with instance-specific linear transformations learned with a meta-encoder to generate both local and global explanations. Through extensive empirical evaluations, we demonstrate the effectiveness of ILLUME in producing feature attributions and decision rules that are not only accurate but also robust and faithful to the black-box, thus providing a unified explanation framework that effectively addresses the limitations of traditional surrogate methods.

[360] Regularized least squares learning with heavy-tailed noise is minimax optimal

Mattes Mollenhauer, Nicole Mücke, Dimitri Meunier, Arthur Gretton

Main category: cs.LG

TL;DR: Ridge regression in RKHS with heavy-tailed noise achieves optimal rates previously only possible under subexponential noise, using Fuk-Nagaev inequalities.

Details

Motivation: To establish that regularized least squares can achieve optimal convergence rates even with heavy-tailed noise (finite higher moments), challenging the prevalent subexponential noise assumption.

Method: Uses integral operator framework and derives excess risk bounds via Fuk-Nagaev inequality for Hilbert-space valued random variables.

Result: Obtains excess risk bounds with dominant subgaussian component, achieving optimal convergence rates under standard eigenvalue decay conditions.

Conclusion: Demonstrates asymptotic robustness of ridge regression against heavy-tailed noise, extending optimal rate guarantees beyond subexponential noise assumptions.

Abstract: This paper examines the performance of ridge regression in reproducing kernel Hilbert spaces in the presence of noise that exhibits a finite number of higher moments. We establish excess risk bounds consisting of subgaussian and polynomial terms based on the well known integral operator framework. The dominant subgaussian component allows to achieve convergence rates that have previously only been derived under subexponential noise - a prevalent assumption in related work from the last two decades. These rates are optimal under standard eigenvalue decay conditions, demonstrating the asymptotic robustness of regularized least squares against heavy-tailed noise. Our derivations are based on a Fuk-Nagaev inequality for Hilbert-space valued random variables.

[361] Causal Graph Neural Networks for Healthcare

Munib Mesinovic, Max Buhlan, Tingting Zhu

Main category: cs.LG

TL;DR: Causal graph neural networks address healthcare AI failures by learning causal mechanisms instead of spurious correlations, enabling more robust and fair clinical applications across various domains.

Details

Motivation: Healthcare AI systems fail when deployed across institutions due to learning statistical associations rather than causal mechanisms, leading to performance drops and perpetuation of discriminatory patterns.

Method: Combines graph-based representations of biomedical data with causal inference principles, including structural causal models, disentangled causal representation learning, interventional prediction, and counterfactual reasoning on graphs.

Result: Demonstrates clinical value in psychiatric diagnosis, cancer subtyping, physiological monitoring, and drug recommendation, establishing foundations for patient-specific Causal Digital Twins.

Conclusion: While promising, substantial barriers remain including computational requirements, validation challenges, and risks of causal-washing; tiered frameworks are proposed to distinguish causally-inspired architectures from causally-validated discoveries.

Abstract: Healthcare artificial intelligence systems routinely fail when deployed across institutions, with documented performance drops and perpetuation of discriminatory patterns embedded in historical data. This brittleness stems, in part, from learning statistical associations rather than causal mechanisms. Causal graph neural networks address this triple crisis of distribution shift, discrimination, and inscrutability by combining graph-based representations of biomedical data with causal inference principles to learn invariant mechanisms rather than spurious correlations. This Review examines methodological foundations spanning structural causal models, disentangled causal representation learning, and techniques for interventional prediction and counterfactual reasoning on graphs. We analyse applications demonstrating clinical value across psychiatric diagnosis through brain network analysis, cancer subtyping via multi-omics causal integration, continuous physiological monitoring with mechanistic interpretation, and drug recommendation correcting prescription bias. These advances establish foundations for patient-specific Causal Digital Twins, enabling in silico clinical experimentation, with integration of large language models for hypothesis generation and causal graph neural networks for mechanistic validation. Substantial barriers remain, including computational requirements precluding real-time deployment, validation challenges demanding multi-modal evidence triangulation beyond cross-validation, and risks of causal-washing where methods employ causal terminology without rigorous evidentiary support. We propose tiered frameworks distinguishing causally-inspired architectures from causally-validated discoveries and identify critical research priorities making causal rather than purely associational claims.

[362] Exact Expressive Power of Transformers with Padding

William Merrill, Ashish Sabharwal

Main category: cs.LG

TL;DR: Padded transformers with polynomial padding recognize FO-uniform TC^0, and when combined with O(log^d n) looping, they recognize exactly FO-uniform TC^d, providing parallelizable alternatives to chain-of-thought.

Details

Motivation: To find more efficient alternatives to chain-of-thought for expanding transformer expressive power without adding parameters, using padding tokens as parallelizable test-time compute.

Method: Use transformers with padding tokens and analyze their computational power with polynomial padding and dynamic depth increase via looping.

Result: Padded transformers with polynomial padding recognize FO-uniform TC^0, and with O(log^d n) looping they recognize exactly FO-uniform TC^d.

Conclusion: Padding and looping together systematically expand transformers’ expressive power, motivating further exploration as parallelizable alternatives to chain-of-thought.

Abstract: Chain of thought is a natural inference-time method for increasing the computational power of transformer-based large language models (LLMs), but comes at the cost of sequential decoding. Are there more efficient alternatives to expand a transformer’s expressive power without adding parameters? We consider transformers with padding tokens as a form of parallelizable test-time compute. We show that averaging-hard-attention, masked-pre-norm transformers with polynomial padding recognize precisely the class $\mathsf{FO}$-uniform $\mathsf{TC}^0$ of extremely parallelizable problems. While the $\mathsf{TC}^0$ upper bound was known, proving a matching lower bound had been elusive. Further, our novel analysis reveals the precise expanded power of padded transformers when coupled with another form of inference-time compute, namely dynamically increasing depth via looping. Our core technical contribution is to show how padding helps bring the notions of complete problems and reductions, which have been a cornerstone of classical complexity theory, to the formal study of transformers. Armed with this new tool, we prove that padded transformers with $O(\log^d n)$ looping on inputs of length $n$ recognize exactly the class $\mathsf{FO}$-uniform $\mathsf{TC}^d$ of moderately parallelizable problems. Thus, padding and looping together systematically expand transformers’ expressive power: with polylogarithmic looping, polynomially padded transformers recognize precisely the class $\mathsf{FO}$-uniform $\mathsf{NC}$, the best that could be expected without losing parallelism (unless $\mathsf{NC} = \mathsf{P}$). Our results thus motivate further exploration of padding and looping as parallelizable alternatives to chain of thought for test-time compute.

[363] On scalable and efficient training of diffusion samplers

Minkyu Kim, Kiyoung Seong, Dongyeop Woo, Sungsoo Ahn, Minsu Kim

Main category: cs.LG

TL;DR: Proposes a scalable framework combining MCMC samplers with diffusion models to efficiently sample from unnormalized energy distributions without data, addressing mode collapse and improving sample efficiency.

Details

Motivation: Existing diffusion samplers struggle with scalability in high-dimensional spaces and expensive energy evaluations, limiting their practical application.

Method: Uses MCMC samplers with novelty-based auxiliary energy as “Searcher” to collect off-policy samples, combined with on-policy data to train diffusion models, plus periodic re-initialization to prevent mode collapse.

Result: Significantly improves sample efficiency on standard benchmarks, excels at higher-dimensional problems and real-world molecular conformer generation.

Conclusion: The proposed framework effectively harmonizes classical sampling methods with diffusion samplers, enabling scalable and efficient sampling from complex energy distributions.

Abstract: We address the challenge of training diffusion models to sample from unnormalized energy distributions in the absence of data, the so-called diffusion samplers. Although these approaches have shown promise, they struggle to scale in more demanding scenarios where energy evaluations are expensive and the sampling space is high-dimensional. To address this limitation, we propose a scalable and sample-efficient framework that properly harmonizes the powerful classical sampling method and the diffusion sampler. Specifically, we utilize Monte Carlo Markov chain (MCMC) samplers with a novelty-based auxiliary energy as a Searcher to collect off-policy samples, using an auxiliary energy function to compensate for exploring modes the diffusion sampler rarely visits. These off-policy samples are then combined with on-policy data to train the diffusion sampler, thereby expanding its coverage of the energy landscape. Furthermore, we identify primacy bias, i.e., the preference of samplers for early experience during training, as the main cause of mode collapse during training, and introduce a periodic re-initialization trick to resolve this issue. Our method significantly improves sample efficiency on standard benchmarks for diffusion samplers and also excels at higher-dimensional problems and real-world molecular conformer generation.

[364] FATE: A Formal Benchmark Series for Frontier Algebra of Multiple Difficulty Levels

Jiedong Jiang, Wanyi He, Yuefeng Wang, Guoxiong Gao, Yongle Hu, Jingting Wang, Nailing Guan, Peihao Wu, Chunbo Dai, Liang Xiao, Bin Dong

Main category: cs.LG

TL;DR: FATE is a new formal algebra benchmark series that goes beyond contest math to evaluate advanced mathematical reasoning, revealing major performance gaps in current LLMs.

Details

Motivation: To bridge the gap between contest-based mathematical benchmarks and the depth/breadth of modern mathematical research, as current benchmarks don't reflect research-level abstraction.

Method: Introduced FATE-H and FATE-X benchmarks with 100 problems each in abstract and commutative algebra, spanning from undergraduate to PhD+ difficulty, with two-stage evaluation of LLMs’ natural-language reasoning vs. formalization ability.

Result: Best LLM achieved only 3% accuracy on FATE-H and 0% on FATE-X, revealing models’ natural-language reasoning is more accurate than their formalization ability, with systematic error classification.

Conclusion: FATE provides essential checkpoints for research-level formal mathematical reasoning, showing current models are far from capable of advanced mathematical research.

Abstract: Recent advances in large language models (LLMs) have demonstrated impressive capabilities in formal theorem proving, particularly on contest-based mathematical benchmarks like the IMO. However, these contests do not reflect the depth, breadth, and abstraction of modern mathematical research. To bridge this gap, we introduce FATE (Formal Algebra Theorem Evaluation), a new benchmark series in formal algebra designed to chart a course toward advanced mathematical reasoning. We present two new components, FATE-H and FATE-X, each with 100 problems in abstract and commutative algebra. The FATE series spans a difficulty spectrum from undergraduate exercises to problems exceeding PhD qualifying exams. Notably, FATE-X is the first formal benchmark to surpass both PhD-level exam difficulty and the coverage of the Mathlib library. Our evaluations of state-of-the-art LLM provers on this new benchmark reveal a stark performance gap compared to contest math: the best model achieves only 3% (pass@64) accuracy on FATE-H and 0% on FATE-X. Our two-stage evaluation reveals that models’ natural-language reasoning is notably more accurate than their ability to formalize this reasoning. We systematically classify the common errors that arise during this formalization process. Furthermore, a comparative study shows that a specialized prover can exhibit less effective reflection than general-purpose models, reducing its accuracy at the natural-language stage. We believe FATE provides a robust and challenging benchmark that establishes essential checkpoints on the path toward research-level formal mathematical reasoning.

[365] Mustafar: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference

Donghyeon Joo, Helya Hosseini, Ramyad Hadidi, Bahar Asgari

Main category: cs.LG

TL;DR: Unstructured sparsity enables 70% KV cache compression for LLMs without accuracy loss, using per-token magnitude-based pruning and custom sparse attention kernels to achieve 2.23x throughput improvement.

Details

Motivation: KV cache size is a major bottleneck in LLM decode performance due to high memory overhead for large context lengths, limiting throughput and context length capabilities.

Method: Systematic exploration of pruning strategies using per-token magnitude-based pruning for Key and Value caches, combined with bitmap-based sparse format and custom attention kernel for direct computation over compressed caches.

Result: Achieves up to 70% sparsity without accuracy compromise, reduces KV cache size to 45% of dense inference, enables 2.23x tokens/sec throughput improvement and longer context lengths.

Conclusion: Unstructured sparsity with magnitude-based pruning and specialized sparse attention kernels effectively addresses KV cache memory bottlenecks, significantly improving LLM decode performance without requiring fine-tuning.

Abstract: We demonstrate that unstructured sparsity significantly improves KV cache compression for LLMs, enabling sparsity levels up to 70% without compromising accuracy or requiring fine-tuning. We conduct a systematic exploration of pruning strategies and find per-token magnitude-based pruning as highly effective for both Key and Value caches under unstructured sparsity, surpassing prior structured pruning schemes. The Key cache benefits from prominent outlier elements, while the Value cache surprisingly benefits from a simple magnitude-based pruning despite its uniform distribution. KV cache size is the major bottleneck in decode performance due to high memory overhead for large context lengths. To address this, we use a bitmap-based sparse format and a custom attention kernel capable of compressing and directly computing over compressed caches pruned to arbitrary sparsity patterns, significantly accelerating memory-bound operations in decode computations and thereby compensating for the overhead of runtime pruning and compression. Our custom attention kernel coupled with the bitmap-based format delivers substantial compression of KV cache upto 45% of dense inference and thereby enables longer context length and increased tokens/sec throughput of upto 2.23x compared to dense inference. Our pruning mechanism and sparse attention kernel is available at https://github.com/dhjoo98/mustafar.

[366] How do Transformers Learn Implicit Reasoning?

Jiaran Ye, Zijun Yao, Zhidian Huang, Liangming Pan, Jinxin Liu, Yushi Bai, Amy Xin, Weichuan Liu, Xiaoyin Che, Lei Hou, Juanzi Li

Main category: cs.LG

TL;DR: Transformers develop implicit multi-hop reasoning through three stages: memorization, in-distribution generalization, and cross-distribution generalization. Training with atomic triples accelerates learning, and second-hop generalization requires exposure to specific compositional structures.

Details

Motivation: To understand how large language models perform implicit multi-hop reasoning without explicitly verbalizing intermediate steps, and to uncover the underlying mechanisms of this emergent capability.

Method: Training transformers from scratch in a controlled symbolic environment, using diagnostic tools including cross-query semantic patching and cosine-based representational analysis to examine intermediate representations and hidden space clustering.

Result: Revealed a three-stage developmental trajectory and found that successful reasoning correlates with cosine-based clustering in hidden space. Training with atomic triples accelerates learning but isn’t necessary, and second-hop generalization depends on query-level exposure to compositional structures.

Conclusion: The findings provide insights into interpretability of implicit multi-hop reasoning, linking representational structure to reasoning capability and offering pathways to enhance model transparency.

Abstract: Recent work suggests that large language models (LLMs) can perform multi-hop reasoning implicitly – producing correct answers without explicitly verbalizing intermediate steps – but the underlying mechanisms remain poorly understood. In this paper, we study how such implicit reasoning emerges by training transformers from scratch in a controlled symbolic environment. Our analysis reveals a three-stage developmental trajectory: early memorization, followed by in-distribution generalization, and eventually cross-distribution generalization. We find that training with atomic triples is not necessary but accelerates learning, and that second-hop generalization relies on query-level exposure to specific compositional structures. To interpret these behaviors, we introduce two diagnostic tools: cross-query semantic patching, which identifies semantically reusable intermediate representations, and a cosine-based representational lens, which reveals that successful reasoning correlates with the cosine-base clustering in hidden space. This clustering phenomenon in turn provides a coherent explanation for the behavioral dynamics observed across training, linking representational structure to reasoning capability. These findings provide new insights into the interpretability of implicit multi-hop reasoning in LLMs, helping to clarify how complex reasoning processes unfold internally and offering pathways to enhance the transparency of such models.

[367] Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training

William Merrill, Shane Arora, Dirk Groeneveld, Hannaneh Hajishirzi

Main category: cs.LG

TL;DR: The paper introduces an empirical method to directly measure critical batch size (CBS) in language model training, showing CBS evolves from near zero at initialization to plateauing later. This enables batch size warmup strategies that achieve better training efficiency.

Details

Motivation: Existing methods for estimating critical batch size rely on strong assumptions about gradient noise scale, limiting practical trust and applicability. A direct, empirical approach is needed to reliably determine optimal batch sizes.

Method: Developed a simple empirical approach to directly measure critical batch size and track its evolution during training across different model sizes (1B and 7B parameters).

Result: CBS starts near zero at initialization, increases rapidly initially, then plateaus. This pattern holds across model sizes. Batch size warmup strategy achieved 43% fewer gradient steps while slightly improving loss compared to original training.

Conclusion: Batch size warmup (starting small and increasing as CBS grows) is an effective strategy for reliable large-batch training, enabling increased data parallelism without performance compromise.

Abstract: The right batch size is important when training language models at scale: a large batch size is necessary for fast training, but a batch size that is too large will harm token efficiency. To navigate this tradeoff, McCandlish et al. (2018) suggest that a critical batch size (CBS), below which training will not substantially degrade loss, can be estimated based on the gradient noise scale during training. While their method has been adopted in practice, e.g., when training GPT-3, strong assumptions are required to justify gradient noise as a proxy for the CBS, which makes it unclear whether their approach should be trusted in practice, limiting its applicability. In this paper, we introduce a simple, empirical approach to directly measure the CBS and show how the CBS evolves over training. Applying our approach to the OLMo models, we find that CBS is near 0 at initialization, increases rapidly at first, and then plateaus as training progresses. Furthermore, we find that this trend holds across different model sizes (1B and 7B), suggesting CBS from small training runs can inform larger-scale training runs. Our findings about how the CBS changes over training motivate batch size warmup as a natural way to reliably train language models at large batch size: start the batch size small and increase it as the CBS grows. To validate this claim, we use batch size warmup to train OLMo 1B to slightly better loss than the original training run with 43% fewer gradient steps. This shows how our framework can be applied to reliably train language models at larger batch sizes, increasing data parallelism without compromising performance.

[368] Learning-at-Criticality in Large Language Models for Quantum Field Theory and Beyond

Xiansheng Cai, Sihan Hu, Tao Wang, Yuan Huang, Pan Zhang, Youjin Deng, Kun Chen

Main category: cs.LG

TL;DR: Learning at Criticality (LaC) is a reinforcement learning method that tunes Large Language Models to operate at a sharp learning transition point, enabling peak generalization from minimal data for complex symbolic problems in fundamental physics.

Details

Motivation: Address the challenge of applying AI to fundamental physics problems where data is scarce and few guiding principles exist, by leveraging critical phenomena to enhance learning efficiency.

Method: LaC uses reinforcement learning to tune LLMs to a critical learning transition point. A minimal concept-network model (CoNet) is analyzed to understand the underlying mechanism, showing characteristics of second-order phase transitions with power-law distributed solution paths.

Result: The method successfully solves 7-digit base-7 addition problems and symbolic Matsubara sums in quantum field theory. An 8B-parameter LLM tuned by LaC outperforms much larger models on unseen, higher-order problems.

Conclusion: LLMs achieve peak performance by operating at criticality, where scale-free exploration enables extraction of underlying operational rules. LaC leverages physical critical phenomena to empower AI for data-sparse challenges in fundamental physics.

Abstract: Fundamental physics often confronts complex symbolic problems with few guiding exemplars or established principles. While artificial intelligence (AI) offers promise, its typical need for vast datasets to learn from hinders its use in these information-scarce frontiers. We introduce learning at criticality (LaC), a reinforcement learning (RL) scheme that tunes Large Language Models (LLMs) to a sharp learning transition, addressing this information scarcity. At this transition, LLMs achieve peak generalization from minimal data, exemplified by 7-digit base-7 addition – a test of nontrivial arithmetic reasoning. To elucidate this peak, we analyze a minimal concept-network model (CoNet) designed to capture the essence of how LLMs might link tokens. Trained on a single exemplar, this model also undergoes a sharp learning transition. This transition exhibits hallmarks of a second-order phase transition, notably power-law distributed solution path lengths. At this critical point, the system maximizes a ``critical thinking pattern" crucial for generalization, enabled by the underlying scale-free exploration. This suggests LLMs reach peak performance by operating at criticality, where such explorative dynamics enable the extraction of underlying operational rules. We demonstrate LaC in quantum field theory: an 8B-parameter LLM, tuned to its critical point by LaC using a few exemplars of symbolic Matsubara sums, solves unseen, higher-order problems, significantly outperforming far larger models. LaC thus leverages critical phenomena, a physical principle, to empower AI for complex, data-sparse challenges in fundamental physics.

[369] Breaking Data Silos: Towards Open and Scalable Mobility Foundation Models via Generative Continual Learning

Yuan Yuan, Yukun Liu, Chonghua Han, Jie Feng, Yong Li

Main category: cs.LG

TL;DR: MoveGCL is a privacy-preserving framework for training mobility foundation models using generative continual learning without sharing raw data, achieving performance comparable to joint training while protecting privacy.

Details

Motivation: Foundation models have transformed other fields but building them for human mobility is challenging due to privacy concerns and data silos across institutions.

Method: Uses generative continual learning with synthetic trajectory replay from frozen teacher models, knowledge distillation to prevent forgetting, Mixture-of-Experts Transformer with mobility-aware routing, and layer-wise progressive adaptation.

Result: Experiments on six real-world datasets show MoveGCL achieves performance comparable to joint training and significantly outperforms federated learning baselines while providing strong privacy protection.

Conclusion: MoveGCL represents a crucial step toward foundation models for mobility, offering a practical blueprint for open, scalable, and privacy-preserving model development.

Abstract: Foundation models have revolutionized fields such as natural language processing and computer vision by enabling general-purpose learning across diverse tasks and datasets. However, building analogous models for human mobility remains challenging due to the privacy-sensitive nature of mobility data and the resulting data silos across institutions. To bridge this gap, we propose MoveGCL, a scalable and privacy-preserving framework for training mobility foundation models via generative continual learning. Without sharing raw data, MoveGCL enables decentralized and progressive model evolution by replaying synthetic trajectories generated from a frozen teacher model, and reinforces knowledge retention through a tailored distillation strategy that mitigates catastrophic forgetting. To address the heterogeneity of mobility patterns, MoveGCL incorporates a Mixture-of-Experts Transformer with a mobility-aware expert routing mechanism, and employs a layer-wise progressive adaptation strategy to stabilize continual updates. Experiments on six real-world urban datasets demonstrate that MoveGCL achieves performance comparable to joint training and significantly outperforms federated learning baselines, while offering strong privacy protection. MoveGCL marks a crucial step toward unlocking foundation models for mobility, offering a practical blueprint for open, scalable, and privacy-preserving model development in the era of foundation models. To facilitate reproducibility and future research, we have released the code and models at https://github.com/tsinghua-fib-lab/MoveGCL.

[370] Communication Efficient LLM Pre-training with SparseLoCo

Amir Sarfi, Benjamin Thérien, Joel Lidin, Eugene Belilovsky

Main category: cs.LG

TL;DR: SparseLoCo is a communication-efficient training algorithm for LLMs that combines error feedback with Top-k sparsification and 2-bit quantization to achieve extreme sparsity (1-3%) while outperforming full-precision baselines.

Details

Motivation: Distributed training of LLMs faces communication bottlenecks even with reduced frequency methods, as they still require full gradient copies. Existing quantization approaches have limited effectiveness and cannot leverage sparsification effectively.

Method: Uses error feedback with Top-k sparsification and 2-bit quantization, approximating outer momentum locally through an error feedback accumulator combined with aggressive sparsity.

Result: Achieves extreme sparsity as low as 1-3% while outperforming full-precision DiLoCo, with sparse aggregation actually improving model performance in communication-constrained settings.

Conclusion: SparseLoCo provides significant benefits in both performance and communication cost for LLM training in bandwidth-constrained environments.

Abstract: Communication-efficient distributed training algorithms have received considerable interest recently due to their benefits for training Large Language Models (LLMs) in bandwidth-constrained settings, such as across datacenters and over the internet. Despite reducing communication frequency, these methods still typically require communicating a full copy of the model’s gradients-resulting in a communication bottleneck even for cross-datacenter links. Furthermore, they can slightly degrade performance compared to a naive AdamW DDP baseline. While quantization is often applied to reduce the pseudo-gradient’s size, in the context of LLM pre-training, existing approaches have been unable to additionally leverage sparsification and have obtained limited quantization. In this work, we introduce SparseLoCo, a communication-efficient training algorithm for LLMs that effectively leverages error feedback with Top-k sparsification and 2-bit quantization to reach extreme sparsity as low as 1-3% while outperforming full-precision DiLoCo. Our key observations are that outer momentum can be locally approximated by an error feedback accumulator combined with aggressive sparsity, and that sparse aggregation can actually improve model performance. We empirically demonstrate in a range of communication-constrained LLM training settings that SparseLoCo provides significant benefits in both performance and communication cost.

Akaash Kolluri, Shengguang Wu, Joon Sung Park, Michael S. Bernstein

Main category: cs.LG

TL;DR: Finetuning LLMs on social science experiment data (SocSci210 dataset) significantly improves simulation accuracy, achieving 26% better alignment with human responses in unseen studies and 71% improvement in generalizing to new conditions.

Details

Motivation: To leverage LLMs for more accurate social science experiment simulations by finetuning them directly on individual-level responses from past experiments, enabling better experimental hypothesis screening.

Method: Constructed SocSci210 dataset with 2.9M responses from 400K participants across 210 experiments, then finetuned LLMs (Socrates-Qwen-14B) on this data to improve simulation accuracy.

Result: Socrates-Qwen-14B achieved 26% better alignment with human response distributions than base model, outperformed GPT-4o by 13%, showed 71% improvement in generalizing to new conditions, and reduced demographic bias by 10.6%.

Conclusion: Finetuning LLMs on rich social science datasets enables more accurate experimental simulations, suggesting this approach could enhance hypothesis screening in social sciences.

Abstract: Large language models (LLMs) offer a powerful opportunity to simulate the results of social science experiments. In this work, we demonstrate that finetuning LLMs directly on individual-level responses from past experiments meaningfully improves the accuracy of such simulations across diverse social science domains. We construct SocSci210 via an automatic pipeline, a dataset comprising 2.9 million responses from 400,491 participants in 210 open-source social science experiments. Through finetuning, we achieve multiple levels of generalization. In completely unseen studies, our strongest model, Socrates-Qwen-14B, produces predictions that are 26% more aligned with distributions of human responses to diverse outcome questions under varying conditions relative to its base model (Qwen2.5-14B), outperforming GPT-4o by 13%. By finetuning on a subset of conditions in a study, generalization to new unseen conditions is particularly robust, improving by 71%. Since SocSci210 contains rich demographic information, we reduce demographic parity difference, a measure of bias, by 10.6% through finetuning. Because social sciences routinely generate rich, topic-specific datasets, our findings indicate that finetuning on such data could enable more accurate simulations for experimental hypothesis screening. We release our data, models and finetuning code at stanfordhci.github.io/socrates.

[372] CancerGUIDE: Cancer Guideline Understanding via Internal Disagreement Estimation

Alyssa Unell, Noel C. F. Codella, Sam Preston, Peniel Argaw, Wen-wai Yim, Zelalem Gero, Cliff Wong, Rajesh Jena, Eric Horvitz, Amanda K. Hall, Ruican Rachel Zhong, Jiachen Li, Shrey Jain, Mu Wei, Matthew Lungren, Hoifung Poon

Main category: cs.LG

TL;DR: LLM agent-based system for automated NCCN guideline-compliant treatment recommendations for NSCLC patients, using hybrid human-AI annotation to reduce costs while maintaining accuracy and regulatory compliance.

Details

Motivation: Manual translation of complex patient data into evidence-based treatment guidelines is time-consuming, requires specialized expertise, and prone to errors. LLMs offer potential to automate this process and improve accuracy.

Method: Hybrid approach combining human expert annotations with LLM-generated proxy benchmarks, creating an agent framework that predicts relevant guidelines and a meta-classifier for verifying prediction accuracy with calibrated confidence scores.

Result: Strong correlation with expert annotations (Spearman r=0.88, RMSE=0.08), meta-classifier AUROC=0.800 for treatment recommendation accuracy verification, and successful creation of longitudinal dataset with 121 NSCLC cases.

Conclusion: Establishes clinically viable LLM-based guideline adherence system that balances accuracy, interpretability, and regulatory requirements while reducing annotation costs, providing scalable pathway for automated clinical decision support.

Abstract: The National Comprehensive Cancer Network (NCCN) provides evidence-based guidelines for cancer treatment. Translating complex patient presentations into guideline-compliant treatment recommendations is time-intensive, requires specialized expertise, and is prone to error. Advances in large language model (LLM) capabilities promise to reduce the time required to generate treatment recommendations and improve accuracy. We present an LLM agent-based approach to automatically generate guideline-concordant treatment trajectories for patients with non-small cell lung cancer (NSCLC). Our contributions are threefold. First, we construct a novel longitudinal dataset of 121 cases of NSCLC patients that includes clinical encounters, diagnostic results, and medical histories, each expertly annotated with the corresponding NCCN guideline trajectories by board-certified oncologists. Second, we demonstrate that existing LLMs possess domain-specific knowledge that enables high-quality proxy benchmark generation for both model development and evaluation, achieving strong correlation (Spearman coefficient r=0.88, RMSE = 0.08) with expert-annotated benchmarks. Third, we develop a hybrid approach combining expensive human annotations with model consistency information to create both the agent framework that predicts the relevant guidelines for a patient, as well as a meta-classifier that verifies prediction accuracy with calibrated confidence scores for treatment recommendations (AUROC=0.800), a critical capability for communicating the accuracy of outputs, custom-tailoring tradeoffs in performance, and supporting regulatory compliance. This work establishes a framework for clinically viable LLM-based guideline adherence systems that balance accuracy, interpretability, and regulatory requirements while reducing annotation costs, providing a scalable pathway toward automated clinical decision support.

[373] SolarCrossFormer: Improving day-ahead Solar Irradiance Forecasting by Integrating Satellite Imagery and Ground Sensors

Baptiste Schubnel, Jelena Simeunović, Corentin Tissier, Pierre-Jean Alet, Rafael E. Carrillo

Main category: cs.LG

TL;DR: SolarCrossFormer is a deep learning model for day-ahead solar irradiance forecasting that combines satellite images and ground-based meteorological data using graph neural networks, achieving 6.1% error with 15-minute resolution forecasts.

Details

Motivation: Current solar irradiance forecasting solutions lack the temporal and spatial resolution required for large-scale integration of solar PV systems into power grids.

Method: Uses novel graph neural networks to exploit inter- and intra-modal correlations between satellite images and ground-based meteorological station time series data.

Result: Achieves normalized mean absolute error of 6.1% over forecasting horizon, competitive with commercial numerical weather prediction services. Can forecast for any location in Switzerland with 15-minute resolution up to 24 hours ahead.

Conclusion: SolarCrossFormer provides robust, high-resolution solar irradiance forecasting that can incorporate new data without retraining and forecast for locations without input data, making it suitable for real-life operations.

Abstract: Accurate day-ahead forecasts of solar irradiance are required for the large-scale integration of solar photovoltaic (PV) systems into the power grid. However, current forecasting solutions lack the temporal and spatial resolution required by system operators. In this paper, we introduce SolarCrossFormer, a novel deep learning model for day-ahead irradiance forecasting, that combines satellite images and time series from a ground-based network of meteorological stations. SolarCrossFormer uses novel graph neural networks to exploit the inter- and intra-modal correlations of the input data and improve the accuracy and resolution of the forecasts. It generates probabilistic forecasts for any location in Switzerland with a 15-minute resolution for horizons up to 24 hours ahead. One of the key advantages of SolarCrossFormer its robustness in real life operations. It can incorporate new time-series data without retraining the model and, additionally, it can produce forecasts for locations without input data by using only their coordinates. Experimental results over a dataset of one year and 127 locations across Switzerland show that SolarCrossFormer yield a normalized mean absolute error of 6.1 % over the forecasting horizon. The results are competitive with those achieved by a commercial numerical weather prediction service.

[374] Large language models surpass domain-specific architectures for antepartum electronic fetal monitoring analysis

Sheng Wong, Ravi Shankar, Beth Albert, Gabriel Davis Jones

Main category: cs.LG

TL;DR: Fine-tuned LLMs outperform domain-specific and foundation models in CTG classification but require more computational resources, with domain-specific models showing better robustness when uterine-activity signals are missing.

Details

Motivation: To explore the potential of foundation models and LLMs in electronic fetal monitoring (EFM) and cardiotocography (CTG) analysis, as most existing studies rely on domain-specific models without systematic comparisons with modern foundation or language models.

Method: Comprehensive benchmark of over 15 models (domain-specific, time-series, foundation, and language models) using over 2,500 20-minute recordings under a unified framework for automated antepartum CTG classification.

Result: Fine-tuned LLMs consistently outperformed both foundation and domain-specific models across data-availability scenarios, except when uterine-activity signals were absent where domain-specific models showed greater robustness. However, LLMs required substantially higher computational resources.

Conclusion: While fine-tuned LLMs achieved state-of-the-art performance for CTG classification, practical deployment must balance performance with computational efficiency.

Abstract: Foundation models (FMs) and large language models (LLMs) have demonstrated promising generalization across diverse domains for time-series analysis, yet their potential for electronic fetal monitoring (EFM) and cardiotocography (CTG) analysis remains underexplored. Most existing CTG studies relied on domain-specific models and lack systematic comparisons with modern foundation or language models, limiting our understanding of whether these models can outperform specialized systems in fetal health assessment. In this study, we present the first comprehensive benchmark of state-of-the-art architectures for automated antepartum CTG classification. Over 2,500 20-minutes recordings were used to evaluate over 15 models spanning domain-specific, time-series, foundation, and language-model categories under a unified framework. Fine-tuned LLMs consistently outperformed both foundation and domain-specific models across data-availability scenarios, except when uterine-activity signals were absent, where domain-specific models showed greater robustness. These performance gains, however, required substantially higher computational resources. Our results highlight that while fine-tuned LLMs achieved state-of-the-art performance for CTG classification, practical deployment must balance performance with computational efficiency.

[375] Empirical Bayesian Multi-Bandit Learning

Xia Jiang, Rong J. B. Zhu

Main category: cs.LG

TL;DR: Proposes hierarchical Bayesian framework for multi-task contextual bandits with empirical covariance estimation, developing ebmTS and ebmUCB algorithms that outperform existing methods.

Details

Motivation: To enhance decision-making across multiple related bandit tasks by leveraging shared structures while accommodating task-specific heterogeneity, addressing limitations of previous methods that overlook covariance structure learning.

Method: Hierarchical Bayesian framework with empirical Bayesian approach to estimate covariance matrix, developing ebmTS (Thompson Sampling) and ebmUCB (Upper Confidence Bound) algorithms that incorporate estimated prior.

Result: Algorithms achieve lower cumulative regret than existing methods on synthetic and real-world datasets, particularly in complex environments, with provided frequentist regret upper bounds.

Conclusion: The proposed hierarchical Bayesian framework with empirical covariance estimation effectively balances exploration and exploitation across multi-bandits, demonstrating superior performance and filling research gap in multi-bandit problems.

Abstract: Multi-task learning in contextual bandits has attracted significant research interest due to its potential to enhance decision-making across multiple related tasks by leveraging shared structures and task-specific heterogeneity. In this article, we propose a novel hierarchical Bayesian framework for learning in various bandit instances. This framework captures both the heterogeneity and the correlations among different bandit instances through a hierarchical Bayesian model, enabling effective information sharing while accommodating instance-specific variations. Unlike previous methods that overlook the learning of the covariance structure across bandits, we introduce an empirical Bayesian approach to estimate the covariance matrix of the prior distribution. This enhances both the practicality and flexibility of learning across multi-bandits. Building on this approach, we develop two efficient algorithms: ebmTS (Empirical Bayesian Multi-Bandit Thompson Sampling) and ebmUCB (Empirical Bayesian Multi-Bandit Upper Confidence Bound), both of which incorporate the estimated prior into the decision-making process. We provide the frequentist regret upper bounds for the proposed algorithms, thereby filling a research gap in the field of multi-bandit problems. Extensive experiments on both synthetic and real-world datasets demonstrate the superior performance of our algorithms, particularly in complex environments. Our methods achieve lower cumulative regret compared to existing techniques, highlighting their effectiveness in balancing exploration and exploitation across multi-bandits.

[376] LLMs as In-Context Meta-Learners for Model and Hyperparameter Selection

Youssef Attia El Hili, Albert Thomas, Malik Tiomoko, Abdelhakim Benechehab, Corentin Léger, Corinne Ancourt, Balázs Kégl

Main category: cs.LG

TL;DR: LLMs can serve as in-context meta-learners for model and hyperparameter selection by using dataset metadata, achieving competitive performance without expensive search.

Details

Motivation: Model and hyperparameter selection typically requires expert intuition or expensive automated search, creating a need for more accessible and efficient approaches.

Method: Convert datasets into interpretable metadata and prompt LLMs in two modes: zero-shot (using pretrained knowledge) and meta-informed (augmented with examples of models and their past performance).

Result: LLMs can exploit dataset metadata to recommend competitive models and hyperparameters without search, with meta-informed prompting showing improved performance through in-context meta-learning.

Conclusion: LLMs show promise as lightweight, general-purpose assistants for model selection and hyperparameter optimization, demonstrating capacity for in-context meta-learning.

Abstract: Model and hyperparameter selection are critical but challenging in machine learning, typically requiring expert intuition or expensive automated search. We investigate whether large language models (LLMs) can act as in-context meta-learners for this task. By converting each dataset into interpretable metadata, we prompt an LLM to recommend both model families and hyperparameters. We study two prompting strategies: (1) a zero-shot mode relying solely on pretrained knowledge, and (2) a meta-informed mode augmented with examples of models and their performance on past tasks. Across synthetic and real-world benchmarks, we show that LLMs can exploit dataset metadata to recommend competitive models and hyperparameters without search, and that improvements from meta-informed prompting demonstrate their capacity for in-context meta-learning. These results highlight a promising new role for LLMs as lightweight, general-purpose assistants for model selection and hyperparameter optimization.

[377] Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants

Bozhi You, Irene Wang, Zelal Su Mustafaoglu, Abhinav Jangda, Angélica Moreira, Roshan Dathathri, Divya Mahajan, Keshav Pingali

Main category: cs.LG

TL;DR: Flashlight is a compiler-native framework that automatically generates efficient FlashAttention-style kernels for arbitrary attention-based programs in PyTorch, supporting more general attention variants than previous approaches.

Details

Motivation: Existing attention optimization approaches like FlashAttention and FlexAttention have limitations - FlashAttention is specialized, while FlexAttention uses static templates and only supports a subset of attention variants. There's a need for a more flexible solution that can handle arbitrary attention patterns without sacrificing performance.

Method: Flashlight leverages PyTorch’s compilation workflow to automatically fuse and tile attention computations transparently. It generates fused kernels for arbitrary attention-based programs without relying on static templates or predefined kernel specializations.

Result: Flashlight produces kernels with competitive or superior performance to FlexAttention, while supporting all FlexAttention-expressible variants plus more general, data-dependent attention formulations that FlexAttention cannot handle.

Conclusion: Flashlight enables developers to rapidly explore new attention models with native PyTorch code flexibility while maintaining high performance, bridging the gap between flexibility and efficiency in attention implementations.

Abstract: Attention is a fundamental building block of large language models (LLMs), so there have been many efforts to implement it efficiently. For example, FlashAttention leverages tiling and kernel fusion to optimize attention. Recently, a number of variants of attention have been introduced to enhance model quality or efficiency. Supporting them efficiently remains difficult since they usually require specialized kernels or hand-tuned implementations. FlexAttention recently addressed part of this gap by using static programming templates to support FlashAttention-like kernels for a subset of attention variants. In this paper, we introduce Flashlight, a compiler-native framework within the PyTorch ecosystem that automatically generates fused, FlashAttention-style kernels for arbitrary attention-based programs, without relying on static templates or predefined kernel specializations. Flashlight leverages PyTorch’s compilation workflow to fuse and tile attention computations transparently, enabling efficient execution for diverse attention patterns. Not only does it support all variants expressible in the FlexAttention model but it also handles more general, data-dependent attention formulations that are beyond the capabilities of FlexAttention. Our results show that Flashlight produces kernels with competitive or superior performance to FlexAttention, while offering the flexibility of native PyTorch code, enabling developers to rapidly explore new attention models without sacrificing performance.

cs.MA

[378] OptiMA: A Transaction-Based Framework with Throughput Optimization for Very Complex Multi-Agent Systems

Umut Çalıkyılmaz, Nitin Nayak, Jinghua Groppe, Sven Groppe

Main category: cs.MA

TL;DR: Proposes OptiMA framework with transaction-based design and scheduling to handle complexity in very complex multi-agent systems (VCMAS), demonstrating improved performance and scalability.

Details

Motivation: Address pitfalls of increasing complexity in multi-agent systems: susceptibility to faults and performance bottlenecks.

Method: Transaction-based framework design with integrated transaction scheduling, implemented as OptiMA framework.

Result: Successfully executed VCMAS with over 100 agents, achieved performance improvements up to 16% through transaction scheduling.

Conclusion: OptiMA framework effectively manages complexity in large multi-agent systems while providing theoretical analysis and practical tools for future transaction scheduling research.

Abstract: In recent years, the research of multi-agent systems has taken a direction to explore larger and more complex models to fulfill sophisticated tasks. We point out two possible pitfalls that might be caused by increasing complexity; susceptibilities to faults, and performance bottlenecks. To prevent the former threat, we propose a transaction-based framework to design very complex multi-agent systems (VCMAS). To address the second threat, we offer to integrate transaction scheduling into the proposed framework. We implemented both of these ideas to develop the OptiMA framework and show that it is able to facilitate the execution of VCMAS with more than a hundred agents. We also demonstrate the effect of transaction scheduling on such a system by showing improvements up to more than 16%. Furthermore, we also performed a theoretical analysis on the transaction scheduling problem and provided practical tools that can be used for future research on it.

[379] ASAP: an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training

Yuran Ding, Xinwei Chen, Xiaofan Zhang, Zongwei Zhou

Main category: cs.MA

TL;DR: ASAP is an AI agent system that automates performance optimization for distributed LLM training, achieving up to 28% speedup and 1.43x throughput improvement through automated bottleneck diagnosis and sharding configuration.

Details

Motivation: Existing LLM training optimization methods rely on manual tuning or black-box searches that are slow and cannot keep up with rapidly evolving LLM domain, leading to underutilized resources and slow development.

Method: Multi-agent system with Coordinator, Analyzer, and Proposal agents that integrates LLM reasoning with performance profiling, roofline analysis, and knowledge base of best practices to automate diagnosis and optimization recommendations.

Result: ASAP-generated configurations achieved up to 28% training step time reduction and 1.43x throughput improvement, which can be further increased to 2.58x when combined with human expert optimization.

Conclusion: ASAP provides a scalable and explainable methodology for AI-assisted performance engineering in large-scale LLM training, automating optimization processes that were previously manual and time-consuming.

Abstract: Optimizing large-language model (LLM) training on distributed domain-specific accelerator systems presents significant challenges due to its complex optimization space. Existing optimization methods, however, rely on time-consuming manual tuning or resource-intensive black-box searches, which struggle to keep pace with the rapidly evolving LLM domain, leading to slow development and underutilized resources. To address this, we introduce ASAP, an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training. It is a multi-agent system, featuring Coordinator, Analyzer, and Proposal agents, which integrates LLM reasoning with insights from performance profiling tools, roofline analysis, and a knowledge base of best practices and successful past optimizations from human experts. Our proposed design can automate the diagnosis of performance bottlenecks and recommend optimized sharding configurations with reasoning, thus effectively improving the efficiency of distributed LLM training. Experiments have shown that the ASAP-generated sharding configurations can contribute up to 28% training step time reduction and 1.43 times throughput improvement. When combined with additional optimization from human experts, throughput can be further increased to 2.58 times. The proposed ASAP promises to provide a scalable and explainable methodology for AI-assisted performance engineering in large-scale LLM training.

[380] Multi-Agent Collaborative Framework For Math Problem Generation

Kia Karbasi, Kevin Hong, Mohammad Amin Samadi, Gregory Pottie

Main category: cs.MA

TL;DR: A collaborative multi-agent framework for automatic math question generation that uses iterative refinement to better control problem complexity and cognitive demands.

Details

Motivation: Existing transformer-based models struggle to precisely control problem complexity and cognitive demands in automatic question generation for mathematics education.

Method: A collaborative multi-agent framework that leverages multiple agents iteratively refining generated question-answer pairs to balance complexity and cognitive demand.

Result: Preliminary evaluations show improved quality of educational content with better balance between cognitive challenge and clarity across five meta-evaluation criteria.

Conclusion: Collaborative multi-agent workflows can yield more controlled, pedagogically valuable content for advancing automated educational content generation and adaptive learning environments.

Abstract: Automatic question generation (AQG) for mathematics education remains an elusive goal for Intelligent Tutoring Systems and educators. While pre-trained transformer-based language models have significantly advanced natural language generation, they often struggle to precisely control problem complexity and cognitive demands. In this paper, we introduce a collaborative multi-agent framework as a novel method of incorporating inference-time computation into AQG. This approach leverages multiple agents that iteratively refine generated question-answer pairs to better balance complexity and cognitive demand. We evaluate the generated questions on five meta-evaluation criteria: relevance, importance, clarity, difficulty matching, answerability, to assess the system’s ability to control the required complexity and quality of the questions. Preliminary evaluations show that this collaborative multi-agent framework elevates the quality of generated educational content by fostering a more nuanced balance between cognitive challenge and clarity. These promising outcomes suggest that integrating collaborative multi-agent workflows can yield more controlled, pedagogically valuable content that can help advance automated educational content generation and adaptive learning environments.

[381] Robust Multi-Agent Decision-Making in Finite-Population Games

Shinkyu Park, Lucas C. D. Bezerra

Main category: cs.MA

TL;DR: Analysis of KLD-RL model robustness in finite-population games, focusing on parameter tuning to mitigate noise and modeling inaccuracies.

Details

Motivation: To understand how model parameters affect agent decision-making under noise and modeling errors commonly encountered in engineering applications of population games.

Method: Theoretical analysis of KLD-RL model parameters’ influence on noise impact, supported by numerical examples and simulation studies.

Result: Provides insights into effective parameter tuning strategies to mitigate the effects of noise and modeling inaccuracies on agent decision-making.

Conclusion: Theoretical findings are validated through simulations, offering practical guidance for parameter selection in KLD-RL models applied to population games.

Abstract: We study the robustness of an agent decision-making model in finite-population games, with a particular focus on the Kullback-Leibler Divergence Regularized Learning (KLD-RL) model. Specifically, we examine how the model’s parameters influence the impact of various sources of noise and modeling inaccuracies – factors commonly encountered in engineering applications of population games – on agents’ decision-making. Our analysis provides insights into how these parameters can be effectively tuned to mitigate such effects. Theoretical results are supported by numerical examples and simulation studies that validate the analysis and illustrate practical strategies for parameter selection.

[382] Learning Communication Skills in Multi-task Multi-agent Deep Reinforcement Learning

Changxi Zhu, Mehdi Dastani, Shihan Wang

Main category: cs.MA

TL;DR: MCS is a multi-agent deep RL method that enables agents to learn and perform multiple tasks simultaneously using learnable communication protocols and Transformer encoders.

Details

Motivation: To improve coordination and knowledge transfer in multi-agent systems when handling multiple tasks, leveraging communication to share information across tasks.

Method: Uses Transformer encoder to encode task-specific observations into shared message space, with prediction network correlating messages with sender actions for better coordination.

Result: Outperforms multi-task MADRL baselines without communication and single-task MADRL baselines with/without communication on adapted multi-agent benchmark environments.

Conclusion: MCS effectively enables multi-task learning in MADRL through learnable communication protocols, demonstrating superior performance over existing approaches.

Abstract: In multi-agent deep reinforcement learning (MADRL), agents can communicate with one another to perform a task in a coordinated manner. When multiple tasks are involved, agents can also leverage knowledge from one task to improve learning in other tasks. In this paper, we propose Multi-task Communication Skills (MCS), a MADRL with communication method that learns and performs multiple tasks simultaneously, with agents interacting through learnable communication protocols. MCS employs a Transformer encoder to encode task-specific observations into a shared message space, capturing shared communication skills among agents. To enhance coordination among agents, we introduce a prediction network that correlates messages with the actions of sender agents in each task. We adapt three multi-agent benchmark environments to multi-task settings, where the number of agents as well as the observation and action spaces vary across tasks. Experimental results demonstrate that MCS achieves better performance than multi-task MADRL baselines without communication, as well as single-task MADRL baselines with and without communication.

cs.MM

[383] On the Brittleness of CLIP Text Encoders

Allie Tran, Luca Rossetto

Main category: cs.MM

TL;DR: Analysis of CLIP model robustness against non-semantic query perturbations in multimedia retrieval, finding syntactic and semantic changes cause largest instabilities.

Details

Motivation: CLIP models trained on contrastive alignment lack stability towards small input perturbations, especially in manually expressed queries where minor variations cause large ranking differences.

Method: Systematic analysis of lexical, syntactic, and semantic query perturbations across multiple CLIP variants using TRECVID Ad-Hoc Video Search queries and V3C1 video collection.

Result: Syntactic and semantic perturbations drive the largest instabilities, while brittleness is concentrated in trivial surface edits like punctuation and case changes.

Conclusion: Robustness is a critical dimension for evaluating vision-language models beyond benchmark accuracy, highlighting the need for more stable multimodal co-embedding models.

Abstract: Multimodal co-embedding models, especially CLIP, have advanced the state of the art in zero-shot classification and multimedia information retrieval in recent years by aligning images and text in a shared representation space. However, such modals trained on a contrastive alignment can lack stability towards small input perturbations. Especially when dealing with manually expressed queries, minor variations in the query can cause large differences in the ranking of the best-matching results. In this paper, we present a systematic analysis of the effect of multiple classes of non-semantic query perturbations in an multimedia information retrieval scenario. We evaluate a diverse set of lexical, syntactic, and semantic perturbations across multiple CLIP variants using the TRECVID Ad-Hoc Video Search queries and the V3C1 video collection. Across models, we find that syntactic and semantic perturbations drive the largest instabilities, while brittleness is concentrated in trivial surface edits such as punctuation and case. Our results highlight robustness as a critical dimension for evaluating vision-language models beyond benchmark accuracy.

eess.AS

[384] CardioPHON: Quality assessment and self-supervised pretraining for screening of cardiac function based on phonocardiogram recordings

Vladimir Despotovic, Peter Pocta, Andrej Zgank

Main category: eess.AS

TL;DR: CardioPHON is a self-supervised pretrained model for heart sound quality assessment and classification that achieves state-of-the-art performance in detecting abnormal cardiac function from phonocardiogram recordings.

Details

Motivation: Remote monitoring of cardiovascular diseases enables early detection of abnormal cardiac function for timely intervention and personalized treatment. Computer-assisted systems can automatically detect heart sound abnormalities as first-line screening tools.

Method: The model is pretrained in self-supervised fashion on six heart sound datasets, includes automatic removal of low-quality recordings, and uses multimodal approach combining audio and socio-demographic features. Also includes unimodal version using only phonocardiogram recordings.

Result: Achieved best ranking on 2022 George B. Moody PhysioNet heart sound challenge leaderboard. Multimodal model demonstrated superior performance, while unimodal model ranked first among unimodal approaches (overall rank 4), surpassing models using multiple modalities.

Conclusion: CardioPHON is the first publicly released pretrained model for heart sound recordings, facilitating development of data-efficient AI models that can generalize to various downstream tasks in cardiovascular diagnostics.

Abstract: Remote monitoring of cardiovascular diseases plays an essential role in early detection of abnormal cardiac function, enabling timely intervention, improved preventive care, and personalized patient treatment. Abnormalities in the heart sounds can be detected automatically via computer-assisted decision support systems, and used as the first-line screening tool for detection of cardiovascular problems, or for monitoring the effects of treatments and interventions. We propose in this paper CardioPHON, an integrated heart sound quality assessment and classification tool that can be used for screening of abnormal cardiac function from phonocardiogram recordings. The model is pretrained in a self-supervised fashion on a collection of six small- and mid-sized heart sound datasets, enables automatic removal of low quality recordings to ensure that subtle sounds of heart abnormalities are not misdiagnosed, and provides a state-of-the-art performance for the heart sound classification task. The multimodal model that combines audio and socio-demographic features demonstrated superior performance, achieving the best ranking on the official leaderboard of the 2022 George B. Moody PhysioNet heart sound challenge, whereas the unimodal model, that is based only on phonocardiogram recordings, holds the first position among the unimodal approaches (a total rank 4), surpassing the models utilizing multiple modalities. CardioPHON is the first publicly released pretrained model in the domain of heart sound recordings, facilitating the development of data-efficient artificial intelligence models that can generalize to various downstream tasks in cardiovascular diagnostics.

[385] dCoNNear: An Artifact-Free Neural Network Architecture for Closed-loop Audio Signal Processing

Chuan Wen, Guy Torfs, Sarah Verhulst

Main category: eess.AS

TL;DR: dCoNNear is a novel DNN architecture that eliminates tonal and aliasing artifacts in closed-loop audio systems, improving sound quality for hearing-aid and speech-enhancement applications.

Details

Motivation: Current DNN-based closed-loop systems suffer from sound quality degradation due to artifacts from suboptimal sampling methods, particularly tonal and aliasing artifacts.

Method: Developed dCoNNear architecture specifically designed to prevent spurious artifacts in closed-loop frameworks, validated using biophysically realistic auditory models and speech-enhancement experiments.

Result: dCoNNear accurately simulates non-DNN biophysical models while eliminating audible artifacts, significantly improving perceptual sound quality in both hearing-aid and speech-enhancement applications.

Conclusion: dCoNNear provides a robust, perceptually transparent closed-loop processing framework for high-fidelity audio applications without architecture-induced artifacts.

Abstract: Recent advances in deep neural networks (DNNs) have significantly improved various audio processing applications, including speech enhancement, synthesis, and hearing-aid algorithms. DNN-based closed-loop systems have gained popularity in these applications due to their robust performance and ability to adapt to diverse conditions. Despite their effectiveness, current DNN-based closed-loop systems often suffer from sound quality degradation caused by artifacts introduced by suboptimal sampling methods. To address this challenge, we introduce dCoNNear, a novel DNN architecture designed for seamless integration into closed-loop frameworks. This architecture specifically aims to prevent the generation of spurious artifacts-most notably tonal and aliasing artifacts arising from non-ideal sampling layers. We demonstrate the effectiveness of dCoNNear through a proof-of-principle example within a closed-loop framework that employs biophysically realistic models of auditory processing for both normal and hearing-impaired profiles to design personalized hearing-aid algorithms. We further validate the broader applicability and artifact-free performance of dCoNNear through speech-enhancement experiments, confirming its ability to improve perceptual sound quality without introducing architecture-induced artifacts. Our results show that dCoNNear not only accurately simulates all processing stages of existing non-DNN biophysical models but also significantly improves sound quality by eliminating audible artifacts in both hearing-aid and speech-enhancement applications. This study offers a robust, perceptually transparent closed-loop processing framework for high-fidelity audio applications.

[386] Uncertainty Quantification in Melody Estimation using Histogram Representation

Kavya Ranjan Saxena, Vipul Arora

Main category: eess.AS

TL;DR: The paper proposes regression-based methods for melody estimation with uncertainty estimation, addressing limitations of classification-based approaches by better capturing prediction deviation magnitude.

Details

Motivation: Existing classification-based confidence estimation for melody prediction fails to capture the magnitude of deviation from ground truth, limiting reliability of uncertainty estimates.

Method: Three regression-based methods: two map pitch values to continuous range to handle voicing discontinuity, and one Bayesian method that models voicing detection as classification and pitch estimation as regression.

Result: Regression-based formulations provide more reliable uncertainty estimates than classification-based approaches for identifying incorrect pitch predictions.

Conclusion: The Bayesian method performs best among proposed approaches for both melody estimation and associated uncertainty estimation.

Abstract: Confidence estimation can improve the reliability of melody estimation by indicating which predictions are likely incorrect. The existing classification-based approach provides confidence for predicted pitch classes but fails to capture the magnitude of deviation from the ground truth. To address this limitation, we reformulate melody estimation as a regression problem and propose a novel approach to estimate uncertainty directly from the histogram representation of the pitch values, which correlates well with the deviation between the prediction and the ground-truth. We design three methods to model pitch on a continuous support range of histogram, which introduces the challenge of handling the discontinuity of unvoiced from the voiced pitch values. The first two methods address the abrupt discontinuity by mapping the pitch values to a continuous range, while the third adopts a fully Bayesian formulation, which models voicing detection as a classification and voiced pitch estimation as a regression task. Experimental results demonstrate that regression-based formulations yield more reliable uncertainty estimates compared to classification-based approaches in identifying incorrect pitch predictions. Comparing the proposed methods with a state-of-the-art regression model, it is observed that the Bayesian method performs the best at estimating both the melody and its associated uncertainty.

eess.IV

[387] Reconstruction-free segmentation from undersampled k-space using transformers

Yundi Zhang, Nil Stolt-Ansó, Jiazhen Pan, Wenqi Huang, Kerstin Hammernik, Daniel Rueckert

Main category: eess.IV

TL;DR: Direct cardiac segmentation from undersampled k-space measurements using transformer architecture, bypassing intermediate image reconstruction for better performance at high acceleration factors.

Details

Motivation: High acceleration factors limit MRI image reconstruction, which subsequently constrains segmentation models when treated as independent processes.

Method: Transformer architecture encodes global k-space information into latent features, which condition queried coordinates during decoding to generate segmentation class probabilities.

Result: Produces better segmentations across high acceleration factors than image-based segmentation baselines.

Conclusion: Direct segmentation from k-space circumvents intermediate reconstruction, enabling assessment of myocardial structure and function at higher acceleration factors than image-based methods.

Abstract: Motivation: High acceleration factors place a limit on MRI image reconstruction. This limit is extended to segmentation models when treating these as subsequent independent processes. Goal: Our goal is to produce segmentations directly from sparse k-space measurements without the need for intermediate image reconstruction. Approach: We employ a transformer architecture to encode global k-space information into latent features. The produced latent vectors condition queried coordinates during decoding to generate segmentation class probabilities. Results: The model is able to produce better segmentations across high acceleration factors than image-based segmentation baselines. Impact: Cardiac segmentation directly from undersampled k-space samples circumvents the need for an intermediate image reconstruction step. This allows the potential to assess myocardial structure and function on higher acceleration factors than methods that rely on images as input.

[388] Computed Tomography (CT)-derived Cardiovascular Flow Estimation Using Physics-Informed Neural Networks Improves with Sinogram-based Training: A Simulation Study

Jinyuxuan Guo, Gurnoor Singh Khurana, Alejandro Gonzalo Grande, Juan C. del Alamo, Francisco Contijoch

Main category: eess.IV

TL;DR: SinoFlow uses sinogram data directly for CT-based blood flow estimation, outperforming traditional image-based methods by avoiding reconstruction errors and working better with various CT settings.

Details

Motivation: CT imaging is widely used for cardiovascular assessment but lacks direct methods for blood flow velocity estimation from contrast evolution movies. Current approaches using reconstructed images introduce errors.

Method: Generated pulsatile flow in 2D vessel bifurcation using CFD, simulated CT scans with varying parameters (gantry speeds, tube currents, pulse modes), and compared PINN-based flow estimation using reconstructed images (ImageFlow) vs. direct sinogram data (SinoFlow).

Result: SinoFlow significantly improved flow estimation by avoiding filtered backprojection errors, performed robustly across all gantry speeds, produced lower MSE and velocity errors than ImageFlow, and worked well with pulsed-mode imaging with shorter pulse widths.

Conclusion: SinoFlow demonstrates strong potential for CT-based flow estimation, providing a more accurate non-invasive approach for blood flow assessment and informing future PINN applications to CT imaging.

Abstract: Background: Non-invasive imaging-based assessment of blood flow plays a critical role in evaluating heart function and structure. Computed Tomography (CT) is a widely-used imaging modality that can robustly evaluate cardiovascular anatomy and function, but direct methods to estimate blood flow velocity from movies of contrast evolution have not been developed. Purpose: This study evaluates the impact of CT imaging on Physics-Informed Neural Networks (PINN)-based flow estimation and proposes an improved framework, SinoFlow, which uses sinogram data directly to estimate blood flow. Methods: We generated pulsatile flow fields in an idealized 2D vessel bifurcation using computational fluid dynamics and simulated CT scans with varying gantry rotation speeds, tube currents, and pulse mode imaging settings. We compared the performance of PINN-based flow estimation using reconstructed images (ImageFlow) to SinoFlow. Results: SinoFlow significantly improved flow estimation performance by avoiding propagating errors introduced by filtered backprojection. SinoFlow was robust across all tested gantry rotation speeds and consistently produced lower mean squared error and velocity errors than ImageFlow. Additionally, SinoFlow was compatible with pulsed-mode imaging and maintained higher accuracy with shorter pulse widths. Conclusions: This study demonstrates the potential of SinoFlow for CT-based flow estimation, providing a more promising approach for non-invasive blood flow assessment. The findings aim to inform future applications of PINNs to CT images and provide a solution for image-based estimation, with reasonable acquisition parameters yielding accurate flow estimates.

[389] Shape Deformation Networks for Automated Aortic Valve Finite Element Meshing from 3D CT Images

Linchen Qian, Jiasong Chen, Ruonan Gong, Wei Sun, Minliang Liu, Liang Liang

Main category: eess.IV

TL;DR: A template-fitting pipeline with deep neural networks generates structured quadrilateral meshes from 3D CT images for aortic valve modeling, ensuring consistent correspondence and high-quality meshes across patients.

Details

Motivation: Traditional approaches produce irregular triangular meshes with poor element quality and inconsistent correspondence due to anatomical variation, making biomechanical analysis and patient-specific simulations challenging.

Method: A template-fitting pipeline using deep neural networks with a common quad mesh template, employing a simplified loss function with only geometry reconstruction and smoothness regularization terms.

Result: The approach produces high-quality aortic valve surface meshes with improved smoothness and shape quality, requiring fewer explicit regularization terms than traditional methods.

Conclusion: Using structured quad meshes for templates and neural network training ensures mesh correspondence and quality while simplifying training, enhancing the effectiveness and efficiency of aortic valve modeling.

Abstract: Accurate geometric modeling of the aortic valve from 3D CT images is essential for biomechanical analysis and patient-specific simulations to assess valve health or make a preoperative plan. However, it remains challenging to generate aortic valve meshes with both high-quality and consistency across different patients. Traditional approaches often produce triangular meshes with irregular topologies, which can result in poorly shaped elements and inconsistent correspondence due to inter-patient anatomical variation. In this work, we address these challenges by introducing a template-fitting pipeline with deep neural networks to generate structured quad (i.e., quadrilateral) meshes from 3D CT images to represent aortic valve geometries. By remeshing aortic valves of all patients with a common quad mesh template, we ensure a uniform mesh topology with consistent node-to-node and element-to-element correspondence across patients. This consistency enables us to simplify the learning objective of the deep neural networks, by employing a loss function with only two terms (i.e., a geometry reconstruction term and a smoothness regularization term), which is sufficient to preserve mesh smoothness and element quality. Our experiments demonstrate that the proposed approach produces high-quality aortic valve surface meshes with improved smoothness and shape quality, while requiring fewer explicit regularization terms compared to the traditional methods. These results highlight that using structured quad meshes for the template and neural network training not only ensures mesh correspondence and quality but also simplifies the training process, thus enhancing the effectiveness and efficiency of aortic valve modeling.

[390] DeepFixel: Crossing white matter fiber identification through spherical convolutional neural networks

Adam M. Saunders, Lucas W. Remedios, Elyssa M. McMaster, Jongyeon Yoon, Gaurav Rudravaram, Adam Sadriddinov, Praitayini Kanakaraj, Bennett A. Landman, Adam W. Anderson

Main category: eess.IV

TL;DR: DeepFixel is a spherical CNN that efficiently separates crossing white matter fibers in diffusion MRI, outperforming traditional optimization methods in speed while maintaining high accuracy.

Details

Motivation: Crossing white matter fibers in voxels complicate diffusion MRI analysis and cause errors in downstream tasks like tractography, requiring efficient separation methods.

Method: Uses spherical convolutional neural network to approximate nonlinear optimization, modeling fiber probability distribution as spherical mesh with high angular resolution.

Result: Achieves median angular correlation coefficient of 0.973 (vs 1.0 for optimization and 0.988 for fixel-based method) with 0.32 ms per voxel computation time.

Conclusion: DeepFixel successfully disentangles fibers at smaller angular separations and volume fractions than fixel-based methods while being computationally efficient.

Abstract: Diffusion-weighted magnetic resonance imaging allows for reconstruction of models for structural connectivity in the brain, such as fiber orientation distribution functions (ODFs) that describe the distribution, direction, and volume of white matter fiber bundles in a voxel. Crossing white matter fibers in voxels complicate analysis and can lead to errors in downstream tasks like tractography. We introduce one option for separating fiber ODFs by performing a nonlinear optimization to fit ODFs to the given data and penalizing terms that are not symmetric about the axis of the fiber. However, this optimization is non-convex and computationally infeasible across an entire image (approximately 1.01 x 106 ms per voxel). We introduce DeepFixel, a spherical convolutional neural network approximation for this nonlinear optimization. We model the probability distribution of fibers as a spherical mesh with higher angular resolution than a truncated spherical harmonic representation. To validate DeepFixel, we compare to the nonlinear optimization and a fixel-based separation algorithm of two-fiber and three-fiber ODFs. The median angular correlation coefficient is 1 (interquartile range of 0.00) using the nonlinear optimization algorithm, 0.988 (0.317) using a fiber bundle elements or “fixel”-based separation algorithm, and 0.973 (0.004) using DeepFixel. DeepFixel is more computationally efficient than the non-convex optimization (0.32 ms per voxel). DeepFixel’s spherical mesh representation is successful at disentangling at smaller angular separations and smaller volume fractions than the fixel-based separation algorithm.

[391] $μ$NeuFMT: Optical-Property-Adaptive Fluorescence Molecular Tomography via Implicit Neural Representation

Shihan Zhao, Jianru Zhang, Yanan Wu, Linlin Li, Siyuan Shen, Xingjun Zhu, Guoyan Zheng, Jiahua Jiang, Wuwei Ren

Main category: eess.IV

TL;DR: μNeuFMT is a self-supervised FMT reconstruction framework that jointly optimizes fluorescence distribution and optical properties, eliminating need for precise prior knowledge of tissue optics.

Details

Motivation: FMT reconstruction is challenging due to ill-posedness and reliance on inaccurate/unknown tissue optical properties. Supervised deep learning methods have limited generalization beyond training data.

Method: Integrates implicit neural-based scene representation with explicit physical modeling of photon propagation. Jointly optimizes both fluorescence distribution and optical properties (μ) during reconstruction.

Result: Robustly recovers accurate fluorophore distributions and optical coefficients even with severely erroneous initial values (0.5× to 2× of ground truth). Outperforms conventional and supervised deep learning approaches across diverse heterogeneous scenarios.

Conclusion: Establishes a new paradigm for robust and accurate FMT reconstruction, paving the way for more reliable molecular imaging in complex clinically related scenarios like fluorescence guided surgery.

Abstract: Fluorescence Molecular Tomography (FMT) is a promising technique for non-invasive 3D visualization of fluorescent probes, but its reconstruction remains challenging due to the inherent ill-posedness and reliance on inaccurate or often-unknown tissue optical properties. While deep learning methods have shown promise, their supervised nature limits generalization beyond training data. To address these problems, we propose $\mu$NeuFMT, a self-supervised FMT reconstruction framework that integrates implicit neural-based scene representation with explicit physical modeling of photon propagation. Its key innovation lies in jointly optimize both the fluorescence distribution and the optical properties ($\mu$) during reconstruction, eliminating the need for precise prior knowledge of tissue optics or pre-conditioned training data. We demonstrate that $\mu$NeuFMT robustly recovers accurate fluorophore distributions and optical coefficients even with severely erroneous initial values (0.5$\times$ to 2$\times$ of ground truth). Extensive numerical, phantom, and in vivo validations show that $\mu$NeuFMT outperforms conventional and supervised deep learning approaches across diverse heterogeneous scenarios. Our work establishes a new paradigm for robust and accurate FMT reconstruction, paving the way for more reliable molecular imaging in complex clinically related scenarios, such as fluorescence guided surgery.

[392] X-Diffusion: Generating Detailed 3D MRI Volumes From a Single Image Using Cross-Sectional Diffusion Models

Emmanuelle Bourigault, Abdullah Hamdi, Amir Jamaludin

Main category: eess.IV

TL;DR: X-Diffusion is a cross-sectional diffusion model that reconstructs detailed 3D MRI volumes from extremely sparse 2D slice inputs, enabling 2D-to-3D reconstruction from as little as a single slice.

Details

Motivation: Traditional MRI reconstruction requires full 3D scans, making high-resolution imaging slow and expensive. The goal is to accelerate MRI acquisition by reconstructing complete 3D volumes from minimal 2D input data.

Method: X-Diffusion models MRI data as holistic 3D volumes during cross-sectional training and inference, unlike previous approaches that treat scans as collections of 2D slices. It uses a diffusion model framework for 2D-to-3D reconstruction.

Result: X-Diffusion surpasses state-of-the-art methods in quantitative accuracy (PSNR), preserves critical anatomical features, and generalizes beyond training domains (e.g., reconstructing knee MRIs despite brain-only training). Medical expert evaluations confirm clinical relevance.

Conclusion: X-Diffusion is the first method capable of producing detailed 3D MRIs from highly limited 2D input data, potentially accelerating MRI acquisition and reducing costs while maintaining diagnostic quality.

Abstract: Magnetic Resonance Imaging (MRI) is a crucial diagnostic tool, but high-resolution scans are often slow and expensive due to extensive data acquisition requirements. Traditional MRI reconstruction methods aim to expedite this process by filling in missing frequency components in the K-space, performing 3D-to-3D reconstructions that demand full 3D scans. In contrast, we introduce X-Diffusion, a novel cross-sectional diffusion model that reconstructs detailed 3D MRI volumes from extremely sparse spatial-domain inputs, achieving 2D-to-3D reconstruction from as little as a single 2D MRI slice or few slices. A key aspect of X-Diffusion is that it models MRI data as holistic 3D volumes during the cross-sectional training and inference, unlike previous learning approaches that treat MRI scans as collections of 2D slices in standard planes (coronal, axial, sagittal). We evaluated X-Diffusion on brain tumor MRIs from the BRATS dataset and full-body MRIs from the UK Biobank dataset. Our results demonstrate that X-Diffusion not only surpasses state-of-the-art methods in quantitative accuracy (PSNR) on unseen data but also preserves critical anatomical features such as tumor profiles, spine curvature, and brain volume. Remarkably, the model generalizes beyond the training domain, successfully reconstructing knee MRIs despite being trained exclusively on brain data. Medical expert evaluations further confirm the clinical relevance and fidelity of the generated images.To our knowledge, X-Diffusion is the first method capable of producing detailed 3D MRIs from highly limited 2D input data, potentially accelerating MRI acquisition and reducing associated costs. The code is available on the project website https://emmanuelleb985.github.io/XDiffusion/ .

[393] Evaluating and Improving the Effectiveness of Synthetic Chest X-Rays for Medical Image Analysis

Eva Prakash, Jeya Maria Jose Valanarasu, Zhihong Chen, Eduardo Pontes Reis, Andrew Johnston, Anuj Pareek, Christian Bluethgen, Sergios Gatidis, Cameron Olsen, Akshay Chaudhari, Andrew Ng, Curtis Langlotz

Main category: eess.IV

TL;DR: Using latent diffusion models to generate synthetic chest X-rays improves deep learning model performance for classification and segmentation tasks when combined with real data.

Details

Motivation: To explore best-practice approaches for generating synthetic chest X-ray images to augment medical imaging datasets and optimize deep learning model performance in downstream tasks.

Method: Used latent diffusion models conditioned on text prompts and/or segmentation masks, with methods like proxy models and radiologist feedback to improve synthetic data quality. Synthetic images were generated from disease information or geometrically transformed masks and added to real datasets (CheXpert, CANDID-PTX, SIIM, RSNA Pneumonia) to measure performance improvements.

Result: Synthetic data resulted in maximum mean classification F1 score improvement of 0.150453 (P=0.0031) and maximum Dice score improvement of 0.14575 (P=0.0064) compared to using only real data.

Conclusion: Best practices include conditioning on single-disease labels or geometrically transformed segmentation masks, and potentially using proxy modeling for fine-tuning generations.

Abstract: Purpose: To explore best-practice approaches for generating synthetic chest X-ray images and augmenting medical imaging datasets to optimize the performance of deep learning models in downstream tasks like classification and segmentation. Materials and Methods: We utilized a latent diffusion model to condition the generation of synthetic chest X-rays on text prompts and/or segmentation masks. We explored methods like using a proxy model and using radiologist feedback to improve the quality of synthetic data. These synthetic images were then generated from relevant disease information or geometrically transformed segmentation masks and added to ground truth training set images from the CheXpert, CANDID-PTX, SIIM, and RSNA Pneumonia datasets to measure improvements in classification and segmentation model performance on the test sets. F1 and Dice scores were used to evaluate classification and segmentation respectively. One-tailed t-tests with Bonferroni correction assessed the statistical significance of performance improvements with synthetic data. Results: Across all experiments, the synthetic data we generated resulted in a maximum mean classification F1 score improvement of 0.150453 (CI: 0.099108-0.201798; P=0.0031) compared to using only real data. For segmentation, the maximum Dice score improvement was 0.14575 (CI: 0.108267-0.183233; P=0.0064). Conclusion: Best practices for generating synthetic chest X-ray images for downstream tasks include conditioning on single-disease labels or geometrically transformed segmentation masks, as well as potentially using proxy modeling for fine-tuning such generations.

[394] TraceTrans: Translation and Spatial Tracing for Surgical Prediction

Xiyu Luo, Haodong Li, Xinxing Cheng, He Zhao, Yang Hu, Xuan Song, Tianyang Zhang

Main category: eess.IV

TL;DR: TraceTrans is a deformable image translation model for post-operative prediction that generates anatomically consistent images while revealing spatial correspondences with pre-operative inputs.

Details

Motivation: Existing image translation methods focus on matching target distributions but neglect spatial correspondences, leading to structural inconsistencies and hallucinations that undermine reliability in clinical applications requiring anatomical accuracy.

Method: Uses an encoder for feature extraction and dual decoders - one for predicting spatial deformations and another for synthesizing the translated image. The predicted deformation field imposes spatial constraints to ensure anatomical consistency.

Result: Extensive experiments on medical cosmetology and brain MRI datasets show that TraceTrans delivers accurate and interpretable post-operative predictions.

Conclusion: TraceTrans demonstrates potential for reliable clinical deployment by providing anatomically consistent and interpretable predictions through explicit spatial correspondence modeling.

Abstract: Image-to-image translation models have achieved notable success in converting images across visual domains and are increasingly used for medical tasks such as predicting post-operative outcomes and modeling disease progression. However, most existing methods primarily aim to match the target distribution and often neglect spatial correspondences between the source and translated images. This limitation can lead to structural inconsistencies and hallucinations, undermining the reliability and interpretability of the predictions. These challenges are accentuated in clinical applications by the stringent requirement for anatomical accuracy. In this work, we present TraceTrans, a novel deformable image translation model designed for post-operative prediction that generates images aligned with the target distribution while explicitly revealing spatial correspondences with the pre-operative input. The framework employs an encoder for feature extraction and dual decoders for predicting spatial deformations and synthesizing the translated image. The predicted deformation field imposes spatial constraints on the generated output, ensuring anatomical consistency with the source. Extensive experiments on medical cosmetology and brain MRI datasets demonstrate that TraceTrans delivers accurate and interpretable post-operative predictions, highlighting its potential for reliable clinical deployment.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Activation-Space Personality Steering: Hybrid Layer Selection for Stable Trait Control in LLMs

[2] TextualVerifier: Verify TextGrad Step-by-Step

[3] GRDD+: An Extended Greek Dialectal Dataset with Cross-Architecture Fine-tuning Evaluation

[4] PLLuM: A Family of Polish Large Language Models

[5] STARS: Segment-level Token Alignment with Rejection Sampling in Large Language Models

[6] Divide, Cache, Conquer: Dichotomic Prompting for Efficient Multi-Label LLM-Based Classification

[7] CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese

[8] Evaluating Machine Translation Datasets for Low-Web Data Languages: A Gendered Lens

[9] GRAD: Graph-Retrieved Adaptive Decoding for Hallucination Mitigation

[10] Context informs pragmatic interpretation in vision-language models

[11] BAPPA: Benchmarking Agents, Plans, and Pipelines for Automated Text-to-SQL Generation

[12] The Human Flourishing Geographic Index: A County-Level Dataset for the United States, 2013–2023

[13] Direct Semantic Communication Between Large Language Models via Vector Translation

[14] Computational Turing Test Reveals Systematic Differences Between Human and AI Language

[15] Abductive Inference in Retrieval-Augmented Language Models: Generating and Validating Missing Premises

[16] WST: Weakly Supervised Transducer for Automatic Speech Recognition

[17] T-FIX: Text-Based Explanations with Features Interpretable to eXperts

[18] Plan of Knowledge: Retrieval-Augmented Large Language Models for Temporal Knowledge Graph Question Answering

[19] The truth is no diaper: Human and AI-generated associations to emotional words

[20] Improving the Performance of Radiology Report De-identification with Large-Scale Training and Benchmarking Against Cloud Vendor Methods

[21] A Characterization of List Language Identification in the Limit

[22] Batch Prompting Suppresses Overthinking Reasoning Under Constraint: How Batch Prompting Suppresses Overthinking in Reasoning Models

[23] RIDE: Difficulty Evolving Perturbation with Item Response Theory for Mathematical Reasoning

[24] Trustworthy LLM-Mediated Communication: Evaluating Information Fidelity in LLM as a Communicator (LAAC) Framework in Multiple Application Domains

[25] LLM-as-a-Judge is Bad, Based on AI Attempting the Exam Qualifying for the Member of the Polish National Board of Appeal

[26] REMIND: Input Loss Landscapes Reveal Residual Memorization in Post-Unlearning LLMs

[27] Reusing Pre-Training Data at Test Time is a Compute Multiplier

[28] Efficient Topic Extraction via Graph-Based Labeling: A Lightweight Alternative to Deep Models

[29] SSPO: Subsentence-level Policy Optimization

[30] Dynamic Jointly Batch Selection for Data Efficient Machine Translation Fine-Tuning

[31] If I Could Turn Back Time: Temporal Reframing as a Historical Reasoning Task for LLMs

[32] Probabilistic Textual Time Series Depression Detection

[33] ThaiOCRBench: A Task-Diverse Benchmark for Vision-Language Understanding in Thai

[34] RUST-BENCH: Benchmarking LLM Reasoning on Unstructured Text within Structured Tables

[35] OUNLP at TSAR 2025 Shared Task: Multi-Round Text Simplifier via Code Generation

[36] Decoding Emergent Big Five Traits in Large Language Models: Temperature-Dependent Expression and Architectural Clustering

[37] RAGalyst: Automated Human-Aligned Agentic Evaluation for Domain-Specific RAG

[38] Modeling Clinical Uncertainty in Radiology Reports: from Explicit Uncertainty Markers to Implicit Reasoning Pathways

[39] Are language models aware of the road not taken? Token-level uncertainty and hidden state dynamics

[40] IntelliProof: An Argumentation Network-based Conversational Helper for Organized Reflection

[41] From Model to Breach: Towards Actionable LLM-Generated Vulnerabilities Reporting

[42] BanglaMedQA and BanglaMMedBench: Evaluating Retrieval-Augmented Generation Strategies for Bangla Biomedical Question Answering

[43] When retrieval outperforms generation: Dense evidence retrieval for scalable fake news detection

[44] Logit-Entropy Adaptive Stopping Heuristic for Efficient Chain-of-Thought Reasoning

[45] Decomposed Prompting: Probing Multilingual Linguistic Structure Knowledge in Large Language Models

[46] LLM Targeted Underperformance Disproportionately Impacts Vulnerable Users

[47] Legal Fact Prediction: The Missing Piece in Legal Judgment Prediction

[48] DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination

[49] Who is the root in a syntactic dependency structure?

[50] KGGen: Extracting Knowledge Graphs from Plain Text with Language Models

[51] Pragmatic Reasoning improves LLM Code Generation

[52] GraphCheck: Multipath Fact-Checking with Entity-Relationship Graphs

[53] Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality

[54] Efficient Model Development through Fine-tuning Transfer

[55] TathyaNyaya and FactLegalLlama: Advancing Factual Judgment Prediction and Explanation in the Indian Legal Context

[56] Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards

[57] What Are They Talking About? A Benchmark of Knowledge-Grounded Discussion Summarization

[58] On Multilingual Encoder Language Model Compression for Low-Resource Languages

[59] Compression Hacking: A Supplementary Perspective on Informatics Properties of Language Models from Geometric Distortion

[60] Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics

[61] Reasoning Models Hallucinate More: Factuality-Aware Reinforcement Learning for Large Reasoning Models

[62] Homogeneous Keys, Heterogeneous Values: Exploiting Local KV Cache Asymmetry for Long-Context LLMs

[63] TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs

[64] FinEval-KR: A Financial Domain Evaluation Framework for Large Language Models’ Knowledge and Reasoning

[65] Text2VectorSQL: Towards a Unified Interface for Vector Search and SQL Queries

[66] Distillation versus Contrastive Learning: How to Train Your Rerankers

[67] XL-DURel: Finetuning Sentence Transformers for Ordinal Word-in-Context Classification

[68] DYNARTmo: A Dynamic Articulatory Model for Visualization of Speech Movement Patterns

[69] NyayaRAG: Realistic Legal Judgment Prediction with RAG under the Indian Common Law System

[70] RPRO: Ranked Preference Reinforcement Optimization for Enhancing Medical QA and Diagnostic Reasoning

[71] Balancing Quality and Variation: Spam Filtering Distorts Data Label Distributions

[72] Memorization in Large Language Models in Medicine: Prevalence, Characteristics, and Implications

[73] Findings of the Fourth Shared Task on Multilingual Coreference Resolution: Can LLMs Dethrone Traditional Approaches?

[74] CorPipe at CRAC 2025: Evaluating Multilingual Encoders for Multilingual Coreference Resolution

[75] Training Large Language Models To Reason In Parallel With Global Forking Tokens

[76] Mathematics with large language models as provers and verifiers

[77] META-RAG: Meta-Analysis-Inspired Evidence-Re-Ranking Method for Retrieval-Augmented Generation in Evidence-Based Medicine